Project Phase 0 - Problem Scoping & Problem Description¶

The problem is to determine whether or not a website is legitimate or used for phishing. Phishing websites are fraudulent websites designed to deceive users and steal their sensitive information, such as login credentials or financial details.¶

Phishing impacts everyone who uses the internet, but certain groups have a greater interest in addressing phishing. Businesses that operate online have an interest in identifying and blocking phishing websites to safeguard their customers' data and maintain their trust. Additionally, law enforcement agencies that focus on cybercrimes like phishing would benefit from the early detection of phishing websites. Finally, internet users are stakeholders because they rely on the accuracy of detection systems to protect themselves from falling victim to online scams¶

Phishing websites often use complex and evolving tactics. Machine learning models can identify patterns and anomalies that may not be apparent with traditional methods. Furthermore, machine learning models can adapt to changing phishing techniques and evolving threats. Finally machine learning leverages historical data to make predictions, allowing for the identification of new and emerging phishing websites. Despite the advantages, machine learning models may still generate false positives and negatives, and adjustments are necessary to correct these errors¶

Research Questions:¶

1. How can we minimize false positives while maintaining a high level of true positives in the detection process?¶

2. How can we ensure that the model generalizes well to new phishing tactics?¶

3. What are the most important features for distinguishing between legitimate and phishing websites?¶

Project Phase I - Data Preprocessing¶

Name: Leonardo Cacho and Nicholai Espenido¶

Email: lmc6615@psu.edu and nje5226@psu.edu¶

Contributions: Imputation Leo, finding and removing outliers Leo, assignment submission Leo, normalization Leo, justifications NJ, column deletion NJ, encoding NJ¶

Preparation¶

We started by importing the data and necessary libraries that we used throughout the project.¶

In [196]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pprint

df=pd.read_csv('Phishing_Legitimate_train_missing_data (2).csv',index_col='id',na_values=['',' ','n/a','null'])
df.head()
Out[196]:
NumDots SubdomainLevel PathLevel UrlLength NumDash NumDashInHostname AtSymbol TildeSymbol NumUnderscore NumPercent ... InsecureForms RelativeFormAction ExtFormAction AbnormalFormAction RightClickDisabled PopUpWindow IframeOrFrame MissingTitle ImagesOnlyInForm CLASS_LABEL
id
1 3.0 1.0 5.0 81.0 1.0 1.0 0.0 0.0 1.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
2 2.0 0.0 5.0 78.0 1.0 1.0 0.0 0.0 3.0 0.0 ... 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
3 3.0 0.0 4.0 53.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1
4 3.0 1.0 6.0 68.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
5 3.0 0.0 3.0 61.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1

5 rows × 38 columns

In [197]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5000 entries, 1 to 5000
Data columns (total 38 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   NumDots             4999 non-null   float64
 1   SubdomainLevel      4999 non-null   float64
 2   PathLevel           4999 non-null   float64
 3   UrlLength           4994 non-null   float64
 4   NumDash             4998 non-null   float64
 5   NumDashInHostname   4996 non-null   float64
 6   AtSymbol            4997 non-null   float64
 7   TildeSymbol         4997 non-null   float64
 8   NumUnderscore       4998 non-null   float64
 9   NumPercent          4997 non-null   float64
 10  NumQueryComponents  4996 non-null   float64
 11  NumAmpersand        4997 non-null   float64
 12  NumHash             4995 non-null   float64
 13  NumNumericChars     4996 non-null   float64
 14  NoHttps             4998 non-null   float64
 15  RandomString        4998 non-null   float64
 16  IpAddress           4998 non-null   float64
 17  DomainInSubdomains  4998 non-null   float64
 18  DomainInPaths       4997 non-null   float64
 19  HttpsInHostname     4998 non-null   float64
 20  HostnameLength      4993 non-null   float64
 21  PathLength          4994 non-null   float64
 22  QueryLength         4994 non-null   float64
 23  DoubleSlashInPath   4995 non-null   float64
 24  NumSensitiveWords   4995 non-null   float64
 25  EmbeddedBrandName   4995 non-null   float64
 26  PctExtResourceUrls  4992 non-null   float64
 27  ExtFavicon          4995 non-null   float64
 28  InsecureForms       4995 non-null   float64
 29  RelativeFormAction  4995 non-null   float64
 30  ExtFormAction       4995 non-null   float64
 31  AbnormalFormAction  4995 non-null   float64
 32  RightClickDisabled  4995 non-null   float64
 33  PopUpWindow         4995 non-null   float64
 34  IframeOrFrame       4995 non-null   float64
 35  MissingTitle        4995 non-null   float64
 36  ImagesOnlyInForm    4995 non-null   float64
 37  CLASS_LABEL         5000 non-null   int64  
dtypes: float64(37), int64(1)
memory usage: 1.5 MB

Identifying and Removing Missing Values¶

After importing the data, we looked for rows with missing values. Then we decided to remove rows which lacked values for more than two features. Removing rows with missing values is important because missing values can significantly harm the accuracy of our machine learning model.¶

In [198]:
missing_rows = df[df.isnull().any(axis=1)]
print(missing_rows)
      NumDots  SubdomainLevel  PathLevel  UrlLength  NumDash  \
id                                                             
7         NaN             0.0        1.0        NaN      0.0   
23        2.0             0.0        1.0        NaN     12.0   
27        4.0             1.0        3.0       72.0      0.0   
145       1.0             0.0        6.0        NaN     12.0   
150       3.0             1.0        3.0       56.0      0.0   
419       3.0             1.0        5.0       73.0      1.0   
831       1.0             0.0        0.0       30.0      1.0   
903       2.0             NaN        NaN        NaN      NaN   
980       2.0             0.0        3.0       52.0      1.0   
1011      2.0             1.0        3.0       64.0      1.0   
1015      3.0             1.0        4.0      101.0     10.0   
1236      1.0             0.0        5.0        NaN     12.0   
1238      2.0             1.0        2.0       36.0      1.0   
1275      7.0             2.0        2.0      206.0     55.0   
1313      2.0             0.0        5.0       60.0      0.0   
1760      1.0             0.0        5.0       82.0      NaN   
1821      2.0             0.0        5.0       70.0      1.0   
2776      2.0             0.0        4.0       73.0      1.0   
2777      2.0             1.0        0.0       25.0      0.0   
2778      2.0             1.0        3.0       42.0      0.0   
2779      2.0             0.0        7.0       86.0      0.0   
2780      3.0             1.0        2.0       72.0      4.0   
4178      2.0             1.0        3.0      118.0      0.0   
4179      3.0             1.0        4.0       53.0      0.0   
4558      4.0             1.0        4.0       70.0      1.0   
4898      4.0             0.0        1.0      213.0      2.0   
4912      4.0             1.0        4.0       60.0      0.0   
4963      1.0             0.0        5.0       95.0      8.0   
4973      2.0             0.0        1.0       62.0      0.0   
4988      3.0             1.0        1.0        NaN      9.0   

      NumDashInHostname  AtSymbol  TildeSymbol  NumUnderscore  NumPercent  \
id                                                                          
7                   0.0       0.0          0.0            2.0         0.0   
23                  0.0       0.0          0.0            0.0         0.0   
27                  0.0       0.0          0.0            0.0         0.0   
145                 0.0       0.0          0.0            0.0         0.0   
150                 0.0       0.0          0.0            0.0         0.0   
419                 0.0       0.0          0.0            0.0         0.0   
831                 1.0       0.0          0.0            0.0         0.0   
903                 NaN       NaN          NaN            0.0         0.0   
980                 NaN       NaN          NaN            NaN         NaN   
1011                0.0       0.0          1.0            0.0         0.0   
1015                0.0       0.0          0.0            0.0         0.0   
1236                0.0       0.0          0.0            0.0         0.0   
1238                0.0       0.0          0.0            0.0         0.0   
1275                0.0       0.0          0.0            0.0         0.0   
1313                NaN       0.0          0.0            0.0         0.0   
1760                NaN       NaN          NaN            NaN         NaN   
1821                0.0       0.0          0.0            0.0         0.0   
2776                0.0       0.0          0.0            0.0         0.0   
2777                0.0       0.0          0.0            0.0         0.0   
2778                0.0       0.0          0.0            0.0         0.0   
2779                0.0       0.0          0.0            0.0         0.0   
2780                0.0       0.0          0.0            0.0         0.0   
4178                0.0       0.0          0.0            0.0         0.0   
4179                0.0       0.0          0.0            0.0         0.0   
4558                0.0       0.0          0.0            0.0         0.0   
4898                0.0       0.0          0.0            2.0         NaN   
4912                0.0       0.0          0.0            0.0         0.0   
4963                0.0       0.0          0.0            0.0         0.0   
4973                0.0       0.0          0.0            0.0         0.0   
4988                0.0       0.0          0.0            2.0         0.0   

      ...  InsecureForms  RelativeFormAction  ExtFormAction  \
id    ...                                                     
7     ...            1.0                 0.0            0.0   
23    ...            1.0                 0.0            0.0   
27    ...            1.0                 0.0            0.0   
145   ...            1.0                 0.0            0.0   
150   ...            1.0                 0.0            0.0   
419   ...            1.0                 0.0            0.0   
831   ...            0.0                 0.0            0.0   
903   ...            0.0                 0.0            0.0   
980   ...            1.0                 0.0            0.0   
1011  ...            1.0                 0.0            0.0   
1015  ...            1.0                 0.0            0.0   
1236  ...            0.0                 0.0            0.0   
1238  ...            1.0                 0.0            0.0   
1275  ...            0.0                 0.0            0.0   
1313  ...            1.0                 0.0            0.0   
1760  ...            1.0                 0.0            0.0   
1821  ...            1.0                 0.0            0.0   
2776  ...            NaN                 NaN            NaN   
2777  ...            NaN                 NaN            NaN   
2778  ...            NaN                 NaN            NaN   
2779  ...            NaN                 NaN            NaN   
2780  ...            NaN                 NaN            NaN   
4178  ...            1.0                 0.0            0.0   
4179  ...            1.0                 1.0            0.0   
4558  ...            1.0                 0.0            0.0   
4898  ...            0.0                 0.0            0.0   
4912  ...            1.0                 1.0            0.0   
4963  ...            1.0                 1.0            0.0   
4973  ...            1.0                 1.0            0.0   
4988  ...            0.0                 0.0            0.0   

      AbnormalFormAction  RightClickDisabled  PopUpWindow  IframeOrFrame  \
id                                                                         
7                    0.0                 0.0          0.0            0.0   
23                   0.0                 1.0          0.0            1.0   
27                   0.0                 0.0          0.0            0.0   
145                  0.0                 0.0          0.0            1.0   
150                  0.0                 0.0          0.0            1.0   
419                  0.0                 0.0          0.0            0.0   
831                  0.0                 0.0          0.0            0.0   
903                  0.0                 0.0          0.0            1.0   
980                  0.0                 0.0          0.0            0.0   
1011                 0.0                 0.0          0.0            0.0   
1015                 0.0                 0.0          0.0            1.0   
1236                 0.0                 0.0          0.0            0.0   
1238                 0.0                 0.0          0.0            0.0   
1275                 0.0                 0.0          0.0            0.0   
1313                 0.0                 0.0          0.0            0.0   
1760                 0.0                 0.0          0.0            0.0   
1821                 0.0                 0.0          0.0            1.0   
2776                 NaN                 NaN          NaN            NaN   
2777                 NaN                 NaN          NaN            NaN   
2778                 NaN                 NaN          NaN            NaN   
2779                 NaN                 NaN          NaN            NaN   
2780                 NaN                 NaN          NaN            NaN   
4178                 0.0                 0.0          0.0            0.0   
4179                 0.0                 0.0          0.0            0.0   
4558                 0.0                 0.0          0.0            0.0   
4898                 0.0                 0.0          0.0            1.0   
4912                 0.0                 0.0          0.0            0.0   
4963                 0.0                 0.0          0.0            1.0   
4973                 0.0                 0.0          0.0            1.0   
4988                 0.0                 0.0          0.0            1.0   

      MissingTitle  ImagesOnlyInForm  CLASS_LABEL  
id                                                 
7              0.0               0.0            0  
23             0.0               0.0            0  
27             0.0               0.0            1  
145            0.0               0.0            0  
150            0.0               0.0            1  
419            0.0               0.0            1  
831            0.0               0.0            0  
903            0.0               0.0            0  
980            0.0               0.0            1  
1011           0.0               0.0            1  
1015           0.0               0.0            0  
1236           0.0               0.0            0  
1238           0.0               0.0            1  
1275           0.0               0.0            0  
1313           0.0               0.0            1  
1760           0.0               0.0            0  
1821           0.0               0.0            1  
2776           NaN               NaN            1  
2777           NaN               NaN            0  
2778           NaN               NaN            0  
2779           NaN               NaN            0  
2780           NaN               NaN            0  
4178           0.0               0.0            1  
4179           1.0               0.0            1  
4558           0.0               0.0            1  
4898           0.0               0.0            0  
4912           0.0               0.0            0  
4963           0.0               0.0            0  
4973           0.0               0.0            0  
4988           0.0               0.0            0  

[30 rows x 38 columns]

We decided on a threshold of two for the removal of rows because we believed a lower threshold may have created too much data loss. Also, we believed that higher threshold would not have made enough of a positive change to the reliability of our data.¶

In [199]:
rows_to_drop= df[ df.isnull().sum(axis=1) > 2 ].index
print(rows_to_drop)
df.drop(rows_to_drop,inplace=True)
print(df.shape)
Int64Index([7, 903, 980, 1760, 1821, 2776, 2777, 2778, 2779, 2780], dtype='int64', name='id')
(4990, 38)

Then we examined which columns still had missing values after our previous removal. This part is crucial so that we know which columns need imputation.¶

In [200]:
missing_values = df.isna().sum()
print(missing_values)
NumDots               0
SubdomainLevel        0
PathLevel             0
UrlLength             4
NumDash               0
NumDashInHostname     1
AtSymbol              0
TildeSymbol           0
NumUnderscore         0
NumPercent            1
NumQueryComponents    1
NumAmpersand          0
NumHash               2
NumNumericChars       2
NoHttps               0
RandomString          0
IpAddress             0
DomainInSubdomains    0
DomainInPaths         1
HttpsInHostname       0
HostnameLength        5
PathLength            0
QueryLength           0
DoubleSlashInPath     0
NumSensitiveWords     0
EmbeddedBrandName     0
PctExtResourceUrls    3
ExtFavicon            0
InsecureForms         0
RelativeFormAction    0
ExtFormAction         0
AbnormalFormAction    0
RightClickDisabled    0
PopUpWindow           0
IframeOrFrame         0
MissingTitle          0
ImagesOnlyInForm      0
CLASS_LABEL           0
dtype: int64

Imputation¶

Even after removing several rows with more than two missing values, we still had multiple rows lacking values for certain features. To fix this issue, we used imputation to fill in the remaining missing values in our data set.¶

In [201]:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=10)
blanks=df[['UrlLength','NumDashInHostname','NumPercent','NumQueryComponents','NumHash','NumNumericChars','DomainInPaths','HostnameLength','PctExtResourceUrls']].to_numpy()
blanks_imputed=imputer.fit_transform(blanks)
print(blanks_imputed)
df[['UrlLength','NumDashInHostname','NumPercent','NumQueryComponents','NumHash','NumNumericChars','DomainInPaths','HostnameLength','PctExtResourceUrls']]=blanks_imputed
print(df.head(50))
[[81.          1.          0.         ...  1.         29.
   0.        ]
 [78.          1.          0.         ...  0.         13.
   1.        ]
 [53.          0.          0.         ...  0.         16.
   1.        ]
 ...
 [33.          0.          0.         ...  0.         25.
   0.64705882]
 [47.          0.          0.         ...  0.         15.
   0.        ]
 [37.          0.          0.         ...  0.         17.
   0.14285714]]
    NumDots  SubdomainLevel  PathLevel  UrlLength  NumDash  NumDashInHostname  \
id                                                                              
1       3.0             1.0        5.0       81.0      1.0                1.0   
2       2.0             0.0        5.0       78.0      1.0                1.0   
3       3.0             0.0        4.0       53.0      1.0                0.0   
4       3.0             1.0        6.0       68.0      0.0                0.0   
5       3.0             0.0        3.0       61.0      0.0                0.0   
6       3.0             1.0        2.0       55.0      0.0                0.0   
8       2.0             0.0        1.0       63.0      0.0                0.0   
9       3.0             0.0        3.0       58.0      0.0                0.0   
10      2.0             0.0        4.0       52.0      0.0                0.0   
11      4.0             1.0        5.0       75.0      2.0                2.0   
12      1.0             0.0        5.0       61.0      4.0                0.0   
13      2.0             0.0        2.0       39.0      0.0                0.0   
14      2.0             1.0        3.0       50.0      1.0                0.0   
15      1.0             0.0        4.0      148.0     14.0                0.0   
16      1.0             0.0        3.0       46.0      1.0                0.0   
17      2.0             1.0        2.0       52.0      1.0                1.0   
18      2.0             0.0        6.0       89.0      0.0                0.0   
19      4.0             1.0        3.0       53.0      1.0                0.0   
20      2.0             1.0        4.0       76.0      0.0                0.0   
21      2.0             0.0        2.0      100.0      1.0                0.0   
22      2.0             0.0        4.0       48.0      0.0                0.0   
23      2.0             0.0        1.0       76.7     12.0                0.0   
24      1.0             0.0        2.0       83.0      7.0                0.0   
25      2.0             0.0        3.0       80.0      0.0                0.0   
26      2.0             0.0        4.0       50.0      1.0                0.0   
27      4.0             1.0        3.0       72.0      0.0                0.0   
28      2.0             1.0        0.0       20.0      0.0                0.0   
29      3.0             2.0        2.0       74.0      6.0                0.0   
30      1.0             0.0        5.0       75.0      9.0                0.0   
31      3.0             1.0        3.0       48.0      1.0                1.0   
32      2.0             0.0        5.0       66.0      1.0                0.0   
33      2.0             1.0        0.0       23.0      0.0                0.0   
34      2.0             0.0        3.0       52.0      3.0                2.0   
35      3.0             1.0        2.0       56.0    100.0                0.0   
36      2.0             1.0        2.0       39.0      2.0                0.0   
37      4.0             0.0        5.0       87.0      1.0                0.0   
38      1.0             0.0        4.0       52.0      1.0                0.0   
39      2.0             0.0        3.0       70.0      5.0                0.0   
40      2.0             1.0        2.0       53.0      0.0                0.0   
41      1.0             0.0        0.0       37.0      0.0                0.0   
42      2.0             0.0        5.0       61.0      0.0                0.0   
43      2.0             1.0        2.0       71.0      3.0                1.0   
44      2.0             0.0        2.0       52.0      0.0                0.0   
45      3.0             1.0        5.0       58.0      2.0                1.0   
46      2.0             1.0        4.0       48.0      0.0                0.0   
47      2.0             0.0        1.0       80.0      5.0                0.0   
48      1.0             0.0        5.0       98.0     11.0                0.0   
49      2.0             1.0        2.0       45.0      2.0                0.0   
50      2.0             0.0        3.0      115.0      0.0                0.0   
51      6.0             1.0        6.0      103.0      0.0                0.0   

    AtSymbol  TildeSymbol  NumUnderscore  NumPercent  ...  InsecureForms  \
id                                                    ...                  
1        0.0          0.0            1.0         0.0  ...            1.0   
2        0.0          0.0            3.0         0.0  ...            1.0   
3        0.0          0.0            0.0         0.0  ...            1.0   
4        0.0          0.0            0.0         0.0  ...            1.0   
5        0.0          0.0            0.0         0.0  ...            1.0   
6        0.0          0.0            1.0         0.0  ...            0.0   
8        0.0          0.0            0.0         0.0  ...            1.0   
9        0.0          0.0            0.0         0.0  ...            1.0   
10       0.0          0.0            0.0         0.0  ...            1.0   
11       0.0          0.0            0.0         0.0  ...            1.0   
12       0.0          0.0            0.0         0.0  ...            0.0   
13       0.0          0.0            0.0         0.0  ...            1.0   
14       0.0          0.0            0.0         0.0  ...            1.0   
15       0.0          0.0            0.0         0.0  ...            1.0   
16       0.0          0.0            0.0         0.0  ...            1.0   
17       0.0          0.0            0.0         0.0  ...            0.0   
18       0.0          0.0            1.0         0.0  ...            1.0   
19       0.0          0.0            0.0         0.0  ...            1.0   
20       0.0          0.0            0.0         0.0  ...            0.0   
21       0.0          0.0            5.0         0.0  ...            1.0   
22       0.0          0.0            0.0         0.0  ...            1.0   
23       0.0          0.0            0.0         0.0  ...            1.0   
24       0.0          0.0            0.0         0.0  ...            1.0   
25       0.0          0.0            1.0         0.0  ...            0.0   
26       0.0          0.0            0.0         0.0  ...            1.0   
27       0.0          0.0            0.0         0.0  ...            1.0   
28       0.0          0.0            0.0         0.0  ...            1.0   
29       0.0          0.0            0.0         0.0  ...            1.0   
30       0.0          0.0            0.0         0.0  ...            1.0   
31       0.0          0.0            0.0         0.0  ...            1.0   
32       0.0          0.0            1.0         0.0  ...            1.0   
33       0.0          0.0            0.0         0.0  ...            0.0   
34       0.0          0.0            0.0         0.0  ...            1.0   
35       0.0          0.0            0.0         0.0  ...            1.0   
36       0.0          0.0            0.0         0.0  ...            1.0   
37       0.0          0.0            0.0         0.0  ...            1.0   
38       0.0          0.0            0.0         0.0  ...            1.0   
39       0.0          0.0            0.0         0.0  ...            1.0   
40       0.0          0.0            0.0         2.0  ...            1.0   
41       0.0          0.0            2.0         0.0  ...            0.0   
42       0.0          0.0            0.0         0.0  ...            1.0   
43       0.0          0.0            0.0         0.0  ...            1.0   
44       0.0          0.0            0.0         0.0  ...            1.0   
45       0.0          0.0            0.0         0.0  ...            1.0   
46       0.0          0.0            0.0         0.0  ...            1.0   
47       0.0          0.0            0.0         0.0  ...            1.0   
48       0.0          0.0            0.0         0.0  ...            1.0   
49       0.0          0.0            0.0         0.0  ...            0.0   
50       0.0          0.0            0.0         1.0  ...            1.0   
51       0.0          0.0            0.0         0.0  ...            1.0   

    RelativeFormAction  ExtFormAction  AbnormalFormAction  RightClickDisabled  \
id                                                                              
1                  0.0            0.0                 0.0                 0.0   
2                  1.0            0.0                 0.0                 0.0   
3                  0.0            1.0                 0.0                 0.0   
4                  0.0            0.0                 0.0                 0.0   
5                  0.0            0.0                 0.0                 0.0   
6                  0.0            0.0                 0.0                 0.0   
8                  0.0            0.0                 0.0                 0.0   
9                  1.0            0.0                 0.0                 0.0   
10                 1.0            0.0                 0.0                 0.0   
11                 1.0            0.0                 0.0                 0.0   
12                 0.0            0.0                 0.0                 0.0   
13                 0.0            0.0                 0.0                 0.0   
14                 0.0            1.0                 0.0                 0.0   
15                 0.0            0.0                 0.0                 0.0   
16                 0.0            0.0                 0.0                 0.0   
17                 1.0            0.0                 0.0                 0.0   
18                 0.0            0.0                 0.0                 0.0   
19                 0.0            1.0                 0.0                 0.0   
20                 0.0            0.0                 0.0                 0.0   
21                 1.0            0.0                 0.0                 0.0   
22                 0.0            0.0                 0.0                 0.0   
23                 0.0            0.0                 0.0                 1.0   
24                 0.0            0.0                 0.0                 0.0   
25                 0.0            1.0                 0.0                 0.0   
26                 0.0            0.0                 0.0                 0.0   
27                 0.0            0.0                 0.0                 0.0   
28                 0.0            0.0                 0.0                 0.0   
29                 0.0            0.0                 0.0                 0.0   
30                 1.0            1.0                 0.0                 0.0   
31                 1.0            0.0                 0.0                 0.0   
32                 0.0            0.0                 0.0                 0.0   
33                 0.0            0.0                 0.0                 0.0   
34                 0.0            0.0                 0.0                 0.0   
35                 1.0            0.0                 1.0                 0.0   
36                 0.0            0.0                 0.0                 0.0   
37                 0.0            0.0                 0.0                 0.0   
38                 0.0            0.0                 0.0                 0.0   
39                 0.0            1.0                 0.0                 0.0   
40                 1.0            0.0                 0.0                 0.0   
41                 0.0            0.0                 0.0                 0.0   
42                 1.0            0.0                 0.0                 0.0   
43                 0.0            0.0                 0.0                 0.0   
44                 0.0            0.0                 0.0                 0.0   
45                 0.0            0.0                 0.0                 0.0   
46                 0.0            0.0                 0.0                 0.0   
47                 1.0            0.0                 0.0                 0.0   
48                 0.0            0.0                 0.0                 0.0   
49                 0.0            0.0                 0.0                 0.0   
50                 0.0            0.0                 0.0                 0.0   
51                 1.0            0.0                 0.0                 0.0   

    PopUpWindow  IframeOrFrame  MissingTitle  ImagesOnlyInForm  CLASS_LABEL  
id                                                                           
1           0.0            0.0           0.0               0.0            0  
2           0.0            0.0           0.0               0.0            1  
3           0.0            1.0           0.0               0.0            1  
4           0.0            0.0           0.0               0.0            1  
5           0.0            1.0           0.0               0.0            1  
6           0.0            0.0           1.0               0.0            0  
8           0.0            0.0           0.0               0.0            0  
9           0.0            0.0           1.0               0.0            1  
10          0.0            0.0           0.0               0.0            1  
11          0.0            0.0           0.0               0.0            1  
12          0.0            1.0           0.0               0.0            0  
13          0.0            0.0           0.0               0.0            1  
14          0.0            1.0           0.0               0.0            0  
15          0.0            1.0           0.0               0.0            0  
16          0.0            0.0           0.0               0.0            0  
17          0.0            0.0           0.0               0.0            1  
18          0.0            0.0           1.0               0.0            1  
19          0.0            1.0           0.0               0.0            1  
20          0.0            0.0           1.0               0.0            0  
21          0.0            0.0           0.0               0.0            1  
22          0.0            0.0           0.0               0.0            1  
23          0.0            1.0           0.0               0.0            0  
24          0.0            1.0           0.0               0.0            0  
25          0.0            0.0           0.0               0.0            0  
26          0.0            0.0           0.0               0.0            1  
27          0.0            0.0           0.0               0.0            1  
28          0.0            1.0           0.0               0.0            0  
29          0.0            0.0           0.0               0.0            0  
30          0.0            1.0           0.0               0.0            0  
31          0.0            0.0           0.0               0.0            1  
32          0.0            0.0           0.0               0.0            1  
33          0.0            1.0           0.0               0.0            0  
34          0.0            0.0           0.0               0.0            1  
35          0.0            0.0           0.0               0.0            0  
36          0.0            0.0           0.0               0.0            0  
37          0.0            0.0           0.0               0.0            1  
38          0.0            1.0           0.0               0.0            1  
39          0.0            0.0           0.0               0.0            0  
40          0.0            1.0           0.0               0.0            0  
41          0.0            0.0           0.0               0.0            0  
42          0.0            0.0           0.0               0.0            1  
43          0.0            0.0           0.0               0.0            0  
44          0.0            0.0           0.0               0.0            1  
45          0.0            0.0           0.0               0.0            1  
46          0.0            0.0           0.0               1.0            1  
47          0.0            1.0           0.0               0.0            0  
48          0.0            0.0           0.0               0.0            0  
49          0.0            1.0           1.0               0.0            0  
50          0.0            1.0           0.0               0.0            0  
51          0.0            0.0           0.0               0.0            1  

[50 rows x 38 columns]

Removing Outliers¶

Outliers have the potential to skew our data and harm the accuracy of our model's predictions. Columns with numerical data may contain outliers, so we targeted them in the code below. Any row with outliers in any of these columns will be dropped from the data set to prevent unrealistic bias.¶

In [202]:
from sklearn.neighbors import LocalOutlierFactor
clf = LocalOutlierFactor(n_neighbors=20)
x = df[['NumDots','SubdomainLevel','PathLevel','UrlLength','NumDash','NumDashInHostname','NumUnderscore','NumQueryComponents','NumAmpersand','NumNumericChars','HostnameLength','PathLength','QueryLength']].to_numpy()
outlier_label=clf.fit_predict(x)
print(clf.negative_outlier_factor_)
print(clf.offset_)
print(outlier_label)

rows_to_drop= df.iloc[ clf.negative_outlier_factor_ < -1.5].index
print(rows_to_drop)
df.drop(rows_to_drop,inplace=True)
print(df.shape)
df.head
[-1.09411557 -1.09907337 -0.98927867 ... -1.22883689 -0.99270466
 -1.02708277]
-1.5
[1 1 1 ... 1 1 1]
Int64Index([  23,   35,   60,   97,  145,  150,  188,  216,  251,  258,
            ...
            4519, 4597, 4697, 4698, 4718, 4753, 4756, 4763, 4865, 4988],
           dtype='int64', name='id', length=133)
(4857, 38)
Out[202]:
<bound method NDFrame.head of       NumDots  SubdomainLevel  PathLevel  UrlLength  NumDash  \
id                                                             
1         3.0             1.0        5.0       81.0      1.0   
2         2.0             0.0        5.0       78.0      1.0   
3         3.0             0.0        4.0       53.0      1.0   
4         3.0             1.0        6.0       68.0      0.0   
5         3.0             0.0        3.0       61.0      0.0   
...       ...             ...        ...        ...      ...   
4996      3.0             1.0        1.0       67.0      3.0   
4997      1.0             0.0        2.0       36.0      1.0   
4998      3.0             2.0        0.0       33.0      0.0   
4999      3.0             1.0        2.0       47.0      0.0   
5000      1.0             0.0        2.0       37.0      0.0   

      NumDashInHostname  AtSymbol  TildeSymbol  NumUnderscore  NumPercent  \
id                                                                          
1                   1.0       0.0          0.0            1.0         0.0   
2                   1.0       0.0          0.0            3.0         0.0   
3                   0.0       0.0          0.0            0.0         0.0   
4                   0.0       0.0          0.0            0.0         0.0   
5                   0.0       0.0          0.0            0.0         0.0   
...                 ...       ...          ...            ...         ...   
4996                0.0       0.0          0.0            0.0         0.0   
4997                0.0       0.0          0.0            0.0         0.0   
4998                0.0       0.0          0.0            0.0         0.0   
4999                0.0       0.0          0.0            0.0         0.0   
5000                0.0       0.0          0.0            0.0         0.0   

      ...  InsecureForms  RelativeFormAction  ExtFormAction  \
id    ...                                                     
1     ...            1.0                 0.0            0.0   
2     ...            1.0                 1.0            0.0   
3     ...            1.0                 0.0            1.0   
4     ...            1.0                 0.0            0.0   
5     ...            1.0                 0.0            0.0   
...   ...            ...                 ...            ...   
4996  ...            1.0                 0.0            0.0   
4997  ...            0.0                 0.0            0.0   
4998  ...            1.0                 0.0            1.0   
4999  ...            1.0                 1.0            0.0   
5000  ...            1.0                 0.0            0.0   

      AbnormalFormAction  RightClickDisabled  PopUpWindow  IframeOrFrame  \
id                                                                         
1                    0.0                 0.0          0.0            0.0   
2                    0.0                 0.0          0.0            0.0   
3                    0.0                 0.0          0.0            1.0   
4                    0.0                 0.0          0.0            0.0   
5                    0.0                 0.0          0.0            1.0   
...                  ...                 ...          ...            ...   
4996                 0.0                 0.0          0.0            1.0   
4997                 0.0                 0.0          0.0            1.0   
4998                 0.0                 0.0          0.0            1.0   
4999                 0.0                 0.0          0.0            0.0   
5000                 0.0                 0.0          0.0            1.0   

      MissingTitle  ImagesOnlyInForm  CLASS_LABEL  
id                                                 
1              0.0               0.0            0  
2              0.0               0.0            1  
3              0.0               0.0            1  
4              0.0               0.0            1  
5              0.0               0.0            1  
...            ...               ...          ...  
4996           0.0               0.0            0  
4997           0.0               0.0            0  
4998           0.0               0.0            0  
4999           1.0               0.0            1  
5000           0.0               0.0            1  

[4857 rows x 38 columns]>

Encoding¶

After imputation and outlier removal, now we must encode our data. The columns in the code below contain binary data. Encoding turns these binary values in numerical values that our machine learning model can later use.¶

In [ ]:
df=pd.get_dummies(df, columns=['AtSymbol','TildeSymbol','NoHttps', 'RandomString', 'IpAddress', 'DomainInSubdomains', 'DomainInPaths', 'HttpsInHostname', 'DoubleSlashInPath', 'ExtFavicon', 'InsecureForms', 'RelativeFormAction', 'ExtFormAction', 'AbnormalFormAction', 'RightClickDisabled', 'PopUpWindow', 'IframeOrFrame', 'MissingTitle'],drop_first=True)
print(df)
      NumDots  SubdomainLevel  PathLevel  UrlLength  NumDash  \
id                                                             
1         3.0             1.0        5.0       81.0      1.0   
2         2.0             0.0        5.0       78.0      1.0   
3         3.0             0.0        4.0       53.0      1.0   
4         3.0             1.0        6.0       68.0      0.0   
5         3.0             0.0        3.0       61.0      0.0   
...       ...             ...        ...        ...      ...   
4996      3.0             1.0        1.0       67.0      3.0   
4997      1.0             0.0        2.0       36.0      1.0   
4998      3.0             2.0        0.0       33.0      0.0   
4999      3.0             1.0        2.0       47.0      0.0   
5000      1.0             0.0        2.0       37.0      0.0   

      NumDashInHostname  NumUnderscore  NumPercent  NumQueryComponents  \
id                                                                       
1                   1.0            1.0         0.0                 0.0   
2                   1.0            3.0         0.0                 0.0   
3                   0.0            0.0         0.0                 0.0   
4                   0.0            0.0         0.0                 0.0   
5                   0.0            0.0         0.0                 0.0   
...                 ...            ...         ...                 ...   
4996                0.0            0.0         0.0                 0.0   
4997                0.0            0.0         0.0                 0.0   
4998                0.0            0.0         0.0                 0.0   
4999                0.0            0.0         0.0                 0.0   
5000                0.0            0.0         0.0                 0.0   

      NumAmpersand  ...  DoubleSlashInPath_1.0  ExtFavicon_1.0  \
id                  ...                                          
1              0.0  ...                      0               0   
2              0.0  ...                      0               1   
3              0.0  ...                      0               0   
4              0.0  ...                      0               1   
5              0.0  ...                      0               1   
...            ...  ...                    ...             ...   
4996           0.0  ...                      0               0   
4997           0.0  ...                      0               0   
4998           0.0  ...                      0               0   
4999           0.0  ...                      0               0   
5000           0.0  ...                      0               1   

      InsecureForms_1.0  RelativeFormAction_1.0  ExtFormAction_1.0  \
id                                                                   
1                     1                       0                  0   
2                     1                       1                  0   
3                     1                       0                  1   
4                     1                       0                  0   
5                     1                       0                  0   
...                 ...                     ...                ...   
4996                  1                       0                  0   
4997                  0                       0                  0   
4998                  1                       0                  1   
4999                  1                       1                  0   
5000                  1                       0                  0   

      AbnormalFormAction_1.0  RightClickDisabled_1.0  PopUpWindow_1.0  \
id                                                                      
1                          0                       0                0   
2                          0                       0                0   
3                          0                       0                0   
4                          0                       0                0   
5                          0                       0                0   
...                      ...                     ...              ...   
4996                       0                       0                0   
4997                       0                       0                0   
4998                       0                       0                0   
4999                       0                       0                0   
5000                       0                       0                0   

      IframeOrFrame_1.0  MissingTitle_1.0  
id                                         
1                     0                 0  
2                     0                 0  
3                     1                 0  
4                     0                 0  
5                     1                 0  
...                 ...               ...  
4996                  1                 0  
4997                  1                 0  
4998                  1                 0  
4999                  0                 1  
5000                  1                 0  

[4857 rows x 38 columns]

Normalization¶

Finally, scaled our numerical variables so that they fall into the same value range. We chose normalization because our variables are not symmetric. Additionally, normalization prevents features with larger values from dominating our model. Also, our data is easier to interpret now that it is on a common scale.¶

In [ ]:
df['URLCharacteristics']=np.mean(df[['NumDots','UrlLength','NumDash','NumDashInHostname','NumUnderscore','NumQueryComponents','NumAmpersand','NumNumericChars','PathLength','QueryLength']],axis=1)
max_URLCharacteristics=df['URLCharacteristics'].max()
min_URLCharacteristics=df['URLCharacteristics'].min()
df['URLCharacteristics']=(df['URLCharacteristics']-min_URLCharacteristics)/(max_URLCharacteristics-min_URLCharacteristics)
print(df['URLCharacteristics'])
id
1       0.231330
2       0.233151
3       0.123862
4       0.165756
5       0.149362
          ...   
4996    0.162113
4997    0.065574
4998    0.032787
4999    0.102004
5000    0.058288
Name: URLCharacteristics, Length: 4857, dtype: float64

Project Phase II & III¶

Name: Leonardo Cacho¶

Email: lmc6615@psu.edu¶

After cleaning the data set, the next phase focuses on selecting the seven features on which the model will be built. To choose the features, exploratory data analysis is crucial so that the most optimal features are selected. This process consists of data visualization which will inform the usage of recursive feature elimination.¶

Data Visualization¶

Before conduncting data visualization, the features must be organized into categories based on their relevance to one another.¶

Group 1: URL Structure and Length¶

Description: This group focuses on the structure and length-related features of URLs¶

  • NumDots
  • SubdomainLevel
  • PathLevel
  • UrlLength
  • NumDash
  • NumDashInHostname
  • AtSymbol
  • TildeSymbol
  • NumUnderscore
  • NumPercent
  • NumQueryComponents
  • NumAmpersand
  • NumHash
  • NumNumericChars

Group 2: Domain and Hostname¶

Description: Encompassing features related to domain and hostname characteristics, this group includes information about HTTPS, IP address, the presence of specific strings, and hostname length.¶

  • NoHttps
  • RandomString
  • IpAddress
  • DomainInSubdomains
  • DomainInPaths
  • HttpsInHostname
  • HostnameLength

Group 3: Path Characteristics¶

Description: Centered around characteristics of the path in URLs, this group includes features such as path length, query length, and the occurrence of double slashes in the path.¶

  • PathLength
  • QueryLength
  • DoubleSlashInPath

Group 4: Form and Action Attributes¶

Description: Focusing on attributes related to forms and actions, this group comprises features like sensitive words, embedded brand names, and percentages of external resource URLs.¶

  • NumSensitiveWords
  • EmbeddedBrandName
  • PctExtResourceUrls
  • ExtFavicon
  • InsecureForms
  • RelativeFormAction
  • ExtFormAction
  • AbnormalFormAction
  • RightClickDisabled
  • PopUpWindow
  • IframeOrFrame
  • MissingTitle
  • ImagesOnlyInForm

After categorizing the variables, visual analysis can begin. This step involves looking for correlations between the features to decide which features are unnecessary. To find correlations, I used scatter plots and looked for any trends in the data.¶

In [ ]:
columns_to_plot_URL=['CLASS_LABEL', 'NumDots', 'SubdomainLevel', 'PathLevel', 'UrlLength', 'NumDash', 'NumDashInHostname',
                     'AtSymbol', 'TildeSymbol', 'NumUnderscore', 'NumPercent', 'NumQueryComponents',
                     'NumAmpersand', 'NumHash', 'NumNumericChars']
g=sns.PairGrid(df[columns_to_plot_URL], hue='CLASS_LABEL')
g.map_diag(sns.histplot)
g.map_offdiag(sns.scatterplot)
Out[ ]:
<seaborn.axisgrid.PairGrid at 0x7b72ea229420>

Scatter plots are a great first step because they show the direction in which the data moves. If the data is concentrated at certain point and show similar trends, either positively or negatively, then the features correlate with each other.¶

However, sometimes scatter plots cannot provide enough information to guide a decision. As a result, I also used heatmaps for each group of features to find a numerical value to determine the strength of the possible correlations.¶

In [ ]:
correlation_matrix = df[columns_to_plot_URL].corr()

plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", vmin=-1, vmax=1)
plt.title('Correlation Heatmap')
Out[ ]:
Text(0.5, 1.0, 'Correlation Heatmap')

As shown above, NumDots, SubdomainLevel, and NumHash are correlated. UrlLength and NumNumericChars are correlated. NumHash, NumQueryComponents, and NumAmpersand are strongly correlated. This heatmap shows where significant correlations occur. This will help determine which features to remove later on.¶

Then, I repeated the scatter plot and heatmap process for each group of features in search of more correlations.¶

In [ ]:
columns_to_plot_DomainHostname=['CLASS_LABEL', 'NoHttps', 'RandomString', 'IpAddress', 'DomainInSubdomains', 'DomainInPaths', 'HttpsInHostname', 'HostnameLength']
g=sns.PairGrid(df[columns_to_plot_DomainHostname], hue='CLASS_LABEL')
g.map_diag(sns.histplot)
g.map_offdiag(sns.scatterplot)
Out[ ]:
<seaborn.axisgrid.PairGrid at 0x7b73037088e0>
In [ ]:
correlation_matrix = df[columns_to_plot_DomainHostname].corr()

plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", vmin=-1, vmax=1)
plt.title('Correlation Heatmap')
Out[ ]:
Text(0.5, 1.0, 'Correlation Heatmap')

For the Domain and Hostname group only HostnameLength and DomainInSubdomains are correlated.¶

In [ ]:
columns_to_plot_path=['CLASS_LABEL', 'PathLength', 'QueryLength', 'DoubleSlashInPath']
g=sns.PairGrid(df[columns_to_plot_path], hue='CLASS_LABEL')
g.map_diag(sns.histplot)
g.map_offdiag(sns.scatterplot)
Out[ ]:
<seaborn.axisgrid.PairGrid at 0x7b72ebc3a170>
In [ ]:
correlation_matrix = df[columns_to_plot_path].corr()

plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", vmin=-1, vmax=1)
plt.title('Correlation Heatmap')
Out[ ]:
Text(0.5, 1.0, 'Correlation Heatmap')

The path group showed little to no correlation, aside from two weak negative correlations.¶

In [ ]:
columns_to_plot_FormAction=['CLASS_LABEL', 'NumSensitiveWords', 'EmbeddedBrandName', 'PctExtResourceUrls', 'ExtFavicon', 'InsecureForms',
                   'RelativeFormAction', 'ExtFormAction', 'AbnormalFormAction', 'RightClickDisabled', 'PopUpWindow',
                   'IframeOrFrame', 'MissingTitle', 'ImagesOnlyInForm']
g=sns.PairGrid(df[columns_to_plot_FormAction], hue='CLASS_LABEL')
g.map_diag(sns.histplot)
g.map_offdiag(sns.scatterplot)
Out[ ]:
<seaborn.axisgrid.PairGrid at 0x7b72fee0c460>
In [ ]:
correlation_matrix = df[columns_to_plot_FormAction].corr()

plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", vmin=-1, vmax=1)
plt.title('Correlation Heatmap')
Out[ ]:
Text(0.5, 1.0, 'Correlation Heatmap')

AbnormalFormAction and RelativeFormAction are correlated. ExtFavicon and PctExtResourceUrls are also correlated.¶

After finding the correlations, I used recursive feature elimination to determine which features to remove.¶

In [ ]:
columns_to_select_URL=['NumDots', 'PathLevel', 'UrlLength', 'NumDash', 'NumDashInHostname',
                       'AtSymbol', 'TildeSymbol', 'NumUnderscore', 'NumPercent',
                       'NumNumericChars']

rfe_selector = RFE(estimator=LogisticRegression(),n_features_to_select = 3, step = 1)
rfe_selector.fit(df[columns_to_select_URL], df['CLASS_LABEL'])
print(rfe_selector.get_support())
df[columns_to_select_URL].columns[ rfe_selector.get_support() ]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-24-90b218493e60> in <cell line: 5>()
      3                        'NumNumericChars']
      4 
----> 5 rfe_selector = RFE(estimator=LogisticRegression(),n_features_to_select = 3, step = 1)
      6 rfe_selector.fit(df[columns_to_select_URL], df['CLASS_LABEL'])
      7 print(rfe_selector.get_support())

NameError: name 'RFE' is not defined

As a reminder, NumDots, SubdomainLevel, and NumHash are correlated. UrlLength and NumNumericChars are correlated. NumHash, NumQueryComponents, and NumAmpersand are strongly correlated. NumHash, NumAmpersand, NumQueryComponents, and SubdomainLevel will be omitted because they are noticeably correlated with each other. This logic will guide how the remaining groups are handled.¶

In [ ]:
columns_to_select_DomainHostname=['NoHttps', 'RandomString', 'IpAddress', 'DomainInSubdomains', 'DomainInPaths', 'HttpsInHostname']

rfe_selector = RFE(estimator=LogisticRegression(),n_features_to_select = 3, step = 1)
rfe_selector.fit(df[columns_to_select_DomainHostname], df['CLASS_LABEL'])
print(rfe_selector.get_support())
df[columns_to_select_DomainHostname].columns[ rfe_selector.get_support() ]
[ True False  True  True False False]
Out[ ]:
Index(['NoHttps', 'IpAddress', 'DomainInSubdomains'], dtype='object')

Only HostnameLength and DomainInSubdomains are correlated. HostnameLength will be removed.¶

In [ ]:
columns_to_select_path=['PathLength', 'QueryLength', 'DoubleSlashInPath']

rfe_selector = RFE(estimator=LogisticRegression(),n_features_to_select = 1, step = 1)
rfe_selector.fit(df[columns_to_select_path], df['CLASS_LABEL'])
print(rfe_selector.get_support())
df[columns_to_select_path].columns[ rfe_selector.get_support() ]
[False False  True]
Out[ ]:
Index(['DoubleSlashInPath'], dtype='object')

Due to the weak correlations and small size of the path group, no features were removed beforehand. To compensate for this, n_features_to_select was set to one instead of three.¶

In [ ]:
columns_to_select_FormAction=['NumSensitiveWords', 'EmbeddedBrandName', 'PctExtResourceUrls', 'InsecureForms',
                   'ExtFormAction', 'AbnormalFormAction', 'RightClickDisabled', 'PopUpWindow',
                   'IframeOrFrame', 'MissingTitle', 'ImagesOnlyInForm']

rfe_selector = RFE(estimator=LogisticRegression(),n_features_to_select = 3, step = 1)
rfe_selector.fit(df[columns_to_select_FormAction], df['CLASS_LABEL'])
print(rfe_selector.get_support())
df[columns_to_select_FormAction].columns[ rfe_selector.get_support() ]
[False False False  True False False False  True False  True False]
Out[ ]:
Index(['InsecureForms', 'PopUpWindow', 'MissingTitle'], dtype='object')

AbnormalFormAction and RelativeFormAction are correlated. ExtFavicon and PctExtResourceUrls are also correlated. RelativeFormAction and ExtFavicon will be removed¶

Conclusion¶

After using recursive feature elimination, only 10 features remain. This means that three more features must go.¶

  • 'NumDashInHostname'
  • 'AtSymbol'
  • 'TildeSymbol'
  • 'NoHttps'
  • 'IpAddress'
  • 'DomainInSubdomains'
  • 'DoubleSlashInPath'
  • 'InsecureForms'
  • 'PopUpWindow'
  • 'MissingTitle'

Further research suggests that the 'TildeSymbol' does not provide significant discriminatory power in distinguishing between legitimate and malicious URLs. Also, a 'MissingTitle' is not an automatic sign of a phishing site. Although website titles are essential, a missing title could just be the result of poor web design. Finally 'NumDashInHostname' also seems somewhat irrelevant as I found no evidence that it is associated with phishing sites. However, sources vary on whether or not to include them, but these are minor concerns unrelated to phishing.¶

This is the final list of the seven features for the model.¶

  • 'AtSymbol'
  • 'NoHttps'
  • 'IpAddress'
  • 'DomainInSubdomains'
  • 'DoubleSlashInPath'
  • 'InsecureForms'
  • 'PopUpWindow'

Project Phase IV - Exploratory Data Analysis & Feature selection Using Decision Trees¶

Name: Leonardo Cacho¶

Email: lmc6615@psu.edu¶

Phase II & III focused on selecting features on which to build the model. Now, I must verify my choices using a decision tree. This process will provide a more indepth look at the data and either confirm or disprove the results of the last phase. The decision tree will help visualize and guide my predictions. The goal here revolves around finding the most important features. To accomplish this, I made a decision tree and I used the featureimportances attribute of the DecisionTreeClassifier. This attribute provides a relative measure of the importance of each feature when making predictions.¶

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import tree
In [ ]:
X=df.drop("CLASS_LABEL", axis=1)
Y=df["CLASS_LABEL"]

clf = tree.DecisionTreeClassifier(max_depth=3)
clf = clf.fit(X, Y)

Y_predicted=clf.predict(X)

plt.figure(figsize=(14, 7))
tree.plot_tree(clf.fit(X,Y),filled=True,)

feature_importances = clf.feature_importances_

feature_importance_df = pd.DataFrame({'Feature': X.columns, 'Importance': feature_importances})

feature_importance_df = feature_importance_df[feature_importance_df['Importance'] != 0]

feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

print("Feature Importances:")
print(feature_importance_df)
Feature Importances:
               Feature  Importance
4              NumDash    0.419380
28       InsecureForms    0.234885
10  NumQueryComponents    0.191280
5    NumDashInHostname    0.079934
14             NoHttps    0.038053
26  PctExtResourceUrls    0.032394
0              NumDots    0.004073

To make the output less cluttered, I omitted any features with an importance value of zero. Out of all the features, NumDash, InsecureForms, and NumQueryComponents came out as the most important features (in that order). Phishers often use URL obfuscation techniques, and the presence of an unusually high or low number of dashes might be indicative of phishing attempts, this may explain the high importance of NumDash. Phishing websites often employ deceptive forms to collect sensitive information, such as usernames, passwords, or credit card details. These forms may lack encryption or exhibit other insecure characteristics so it makes sense that InsecureForms is an important feature. The high importance of NumQueryComponents may stem from the fact that phishing websites may manipulate URLs by including a large number of query parameters to obfuscate their true intent or to imitate legitimate websites.¶

Although these new results somewhat confirm my findings from the last phase, they also disprove some of my assumptions. I correctly asserted that InsecureForms would be signficant, but I failed to do the same for NumDash and NumQueryComponents. Going forward, I will focus more on these three features.¶

The confidence in this assessment can be increased by determining the quality of the decision tree. First, we can visualize the decision tree's performance with a confusion matrix. This will show how many true positives and true negatives there are in comparison to false positives and false negatives. In the confusion matrix below, generally the tree is correct with its predictions. Furthermore, it generates more false positives than false negatives. But visualization is only the start, by looking at performance metrics like accuracy, precision, recall, and f1 scores, confidence can be further increased.¶

In [ ]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(Y, Y_predicted, labels=clf.classes_)

disp=ConfusionMatrixDisplay(confusion_matrix=cm,display_labels=clf.classes_)

disp.plot()
Out[ ]:
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7c269b116770>

In the output below, I found that the current model is accurate around 80.6% of the time. Precision is the ratio of correctly predicted positive observations to the total predicted positives. High precision indicates that when the model predicts a certain class, it is correct most of the time. Also high recall indicates that the model is good at capturing instances of the actual positive class. Although these metrics are promising, they can be improved.¶

In [ ]:
from sklearn.metrics import accuracy_score

from sklearn.metrics import precision_score

from sklearn.metrics import recall_score

from sklearn.metrics import f1_score

ac=accuracy_score(Y, Y_predicted)

print(ac)

pre=precision_score(Y, Y_predicted,average=None)
print(pre)

recall=recall_score(Y, Y_predicted,average=None)
print(recall)

f1 = f1_score(Y, Y_predicted, average=None)
print(f1)
0.8060531192093885
[0.85755396 0.76731602]
[0.73489519 0.87747525]
[0.79150066 0.8187067 ]

By altering the parameters of the decision tree, I can raise confidence in its findings by improving its performance. To find the optimal parameters, I used grid search and based its performance on the AUC score.¶

In [ ]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score

min_samples_splits = np.linspace(0.1, 1.0, 10, endpoint=True)
print(min_samples_splits)

max_depths = [None, 5, 10, 20]
print(max_depths)

tuned_parameters = [{'min_samples_split': min_samples_splits, 'max_depth': max_depths}]

base_model = DecisionTreeClassifier()

clf = GridSearchCV(estimator=base_model, param_grid=tuned_parameters, cv=5, verbose=3, scoring='roc_auc')

clf.fit(X, Y)

print("Grid Search Results:")
print(clf.cv_results_)

best_params = clf.best_params_
print("Best Parameters:", best_params)

mean_test_scores = clf.cv_results_['mean_test_score']
print("Mean Test Scores:", mean_test_scores)

best_model = clf.best_estimator_

y_pred_proba = best_model.predict_proba(X)[:, 1]
best_model_auc = roc_auc_score(Y, y_pred_proba)
print("AUC for the Best Model:", best_model_auc)
[0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1. ]
[None, 5, 10, 20]
Fitting 5 folds for each of 40 candidates, totalling 200 fits
[CV 1/5] END max_depth=None, min_samples_split=0.1;, score=0.937 total time=   0.0s
[CV 2/5] END max_depth=None, min_samples_split=0.1;, score=0.947 total time=   0.0s
[CV 3/5] END max_depth=None, min_samples_split=0.1;, score=0.935 total time=   0.0s
[CV 4/5] END max_depth=None, min_samples_split=0.1;, score=0.926 total time=   0.0s
[CV 5/5] END max_depth=None, min_samples_split=0.1;, score=0.937 total time=   0.0s
[CV 1/5] END max_depth=None, min_samples_split=0.2;, score=0.916 total time=   0.0s
[CV 2/5] END max_depth=None, min_samples_split=0.2;, score=0.914 total time=   0.0s
[CV 3/5] END max_depth=None, min_samples_split=0.2;, score=0.897 total time=   0.0s
[CV 4/5] END max_depth=None, min_samples_split=0.2;, score=0.881 total time=   0.0s
[CV 5/5] END max_depth=None, min_samples_split=0.2;, score=0.895 total time=   0.0s
[CV 1/5] END max_depth=None, min_samples_split=0.30000000000000004;, score=0.849 total time=   0.0s
[CV 2/5] END max_depth=None, min_samples_split=0.30000000000000004;, score=0.880 total time=   0.0s
[CV 3/5] END max_depth=None, min_samples_split=0.30000000000000004;, score=0.876 total time=   0.0s
[CV 4/5] END max_depth=None, min_samples_split=0.30000000000000004;, score=0.849 total time=   0.0s
[CV 5/5] END max_depth=None, min_samples_split=0.30000000000000004;, score=0.870 total time=   0.0s
[CV 1/5] END max_depth=None, min_samples_split=0.4;, score=0.828 total time=   0.0s
[CV 2/5] END max_depth=None, min_samples_split=0.4;, score=0.852 total time=   0.0s
[CV 3/5] END max_depth=None, min_samples_split=0.4;, score=0.850 total time=   0.0s
[CV 4/5] END max_depth=None, min_samples_split=0.4;, score=0.821 total time=   0.0s
[CV 5/5] END max_depth=None, min_samples_split=0.4;, score=0.846 total time=   0.0s
[CV 1/5] END max_depth=None, min_samples_split=0.5;, score=0.828 total time=   0.0s
[CV 2/5] END max_depth=None, min_samples_split=0.5;, score=0.852 total time=   0.0s
[CV 3/5] END max_depth=None, min_samples_split=0.5;, score=0.850 total time=   0.0s
[CV 4/5] END max_depth=None, min_samples_split=0.5;, score=0.821 total time=   0.0s
[CV 5/5] END max_depth=None, min_samples_split=0.5;, score=0.846 total time=   0.0s
[CV 1/5] END max_depth=None, min_samples_split=0.6;, score=0.797 total time=   0.0s
[CV 2/5] END max_depth=None, min_samples_split=0.6;, score=0.813 total time=   0.0s
[CV 3/5] END max_depth=None, min_samples_split=0.6;, score=0.814 total time=   0.0s
[CV 4/5] END max_depth=None, min_samples_split=0.6;, score=0.782 total time=   0.0s
[CV 5/5] END max_depth=None, min_samples_split=0.6;, score=0.814 total time=   0.0s
[CV 1/5] END max_depth=None, min_samples_split=0.7000000000000001;, score=0.750 total time=   0.0s
[CV 2/5] END max_depth=None, min_samples_split=0.7000000000000001;, score=0.753 total time=   0.0s
[CV 3/5] END max_depth=None, min_samples_split=0.7000000000000001;, score=0.745 total time=   0.0s
[CV 4/5] END max_depth=None, min_samples_split=0.7000000000000001;, score=0.728 total time=   0.0s
[CV 5/5] END max_depth=None, min_samples_split=0.7000000000000001;, score=0.752 total time=   0.0s
[CV 1/5] END max_depth=None, min_samples_split=0.8;, score=0.692 total time=   0.0s
[CV 2/5] END max_depth=None, min_samples_split=0.8;, score=0.684 total time=   0.0s
[CV 3/5] END max_depth=None, min_samples_split=0.8;, score=0.670 total time=   0.0s
[CV 4/5] END max_depth=None, min_samples_split=0.8;, score=0.654 total time=   0.0s
[CV 5/5] END max_depth=None, min_samples_split=0.8;, score=0.677 total time=   0.0s
[CV 1/5] END max_depth=None, min_samples_split=0.9;, score=0.692 total time=   0.0s
[CV 2/5] END max_depth=None, min_samples_split=0.9;, score=0.684 total time=   0.0s
[CV 3/5] END max_depth=None, min_samples_split=0.9;, score=0.670 total time=   0.0s
[CV 4/5] END max_depth=None, min_samples_split=0.9;, score=0.654 total time=   0.0s
[CV 5/5] END max_depth=None, min_samples_split=0.9;, score=0.677 total time=   0.0s
[CV 1/5] END max_depth=None, min_samples_split=1.0;, score=0.692 total time=   0.0s
[CV 2/5] END max_depth=None, min_samples_split=1.0;, score=0.684 total time=   0.0s
[CV 3/5] END max_depth=None, min_samples_split=1.0;, score=0.670 total time=   0.0s
[CV 4/5] END max_depth=None, min_samples_split=1.0;, score=0.654 total time=   0.0s
[CV 5/5] END max_depth=None, min_samples_split=1.0;, score=0.677 total time=   0.0s
[CV 1/5] END max_depth=5, min_samples_split=0.1;, score=0.909 total time=   0.0s
[CV 2/5] END max_depth=5, min_samples_split=0.1;, score=0.926 total time=   0.0s
[CV 3/5] END max_depth=5, min_samples_split=0.1;, score=0.906 total time=   0.0s
[CV 4/5] END max_depth=5, min_samples_split=0.1;, score=0.882 total time=   0.0s
[CV 5/5] END max_depth=5, min_samples_split=0.1;, score=0.901 total time=   0.0s
[CV 1/5] END max_depth=5, min_samples_split=0.2;, score=0.901 total time=   0.0s
[CV 2/5] END max_depth=5, min_samples_split=0.2;, score=0.904 total time=   0.0s
[CV 3/5] END max_depth=5, min_samples_split=0.2;, score=0.887 total time=   0.0s
[CV 4/5] END max_depth=5, min_samples_split=0.2;, score=0.872 total time=   0.0s
[CV 5/5] END max_depth=5, min_samples_split=0.2;, score=0.889 total time=   0.0s
[CV 1/5] END max_depth=5, min_samples_split=0.30000000000000004;, score=0.849 total time=   0.0s
[CV 2/5] END max_depth=5, min_samples_split=0.30000000000000004;, score=0.880 total time=   0.0s
[CV 3/5] END max_depth=5, min_samples_split=0.30000000000000004;, score=0.876 total time=   0.0s
[CV 4/5] END max_depth=5, min_samples_split=0.30000000000000004;, score=0.849 total time=   0.0s
[CV 5/5] END max_depth=5, min_samples_split=0.30000000000000004;, score=0.870 total time=   0.0s
[CV 1/5] END max_depth=5, min_samples_split=0.4;, score=0.828 total time=   0.0s
[CV 2/5] END max_depth=5, min_samples_split=0.4;, score=0.852 total time=   0.0s
[CV 3/5] END max_depth=5, min_samples_split=0.4;, score=0.850 total time=   0.0s
[CV 4/5] END max_depth=5, min_samples_split=0.4;, score=0.821 total time=   0.0s
[CV 5/5] END max_depth=5, min_samples_split=0.4;, score=0.846 total time=   0.0s
[CV 1/5] END max_depth=5, min_samples_split=0.5;, score=0.828 total time=   0.0s
[CV 2/5] END max_depth=5, min_samples_split=0.5;, score=0.852 total time=   0.0s
[CV 3/5] END max_depth=5, min_samples_split=0.5;, score=0.850 total time=   0.0s
[CV 4/5] END max_depth=5, min_samples_split=0.5;, score=0.821 total time=   0.0s
[CV 5/5] END max_depth=5, min_samples_split=0.5;, score=0.846 total time=   0.0s
[CV 1/5] END max_depth=5, min_samples_split=0.6;, score=0.797 total time=   0.0s
[CV 2/5] END max_depth=5, min_samples_split=0.6;, score=0.813 total time=   0.0s
[CV 3/5] END max_depth=5, min_samples_split=0.6;, score=0.814 total time=   0.0s
[CV 4/5] END max_depth=5, min_samples_split=0.6;, score=0.782 total time=   0.0s
[CV 5/5] END max_depth=5, min_samples_split=0.6;, score=0.814 total time=   0.0s
[CV 1/5] END max_depth=5, min_samples_split=0.7000000000000001;, score=0.750 total time=   0.0s
[CV 2/5] END max_depth=5, min_samples_split=0.7000000000000001;, score=0.753 total time=   0.0s
[CV 3/5] END max_depth=5, min_samples_split=0.7000000000000001;, score=0.745 total time=   0.0s
[CV 4/5] END max_depth=5, min_samples_split=0.7000000000000001;, score=0.728 total time=   0.0s
[CV 5/5] END max_depth=5, min_samples_split=0.7000000000000001;, score=0.752 total time=   0.0s
[CV 1/5] END max_depth=5, min_samples_split=0.8;, score=0.692 total time=   0.0s
[CV 2/5] END max_depth=5, min_samples_split=0.8;, score=0.684 total time=   0.0s
[CV 3/5] END max_depth=5, min_samples_split=0.8;, score=0.670 total time=   0.0s
[CV 4/5] END max_depth=5, min_samples_split=0.8;, score=0.654 total time=   0.0s
[CV 5/5] END max_depth=5, min_samples_split=0.8;, score=0.677 total time=   0.0s
[CV 1/5] END max_depth=5, min_samples_split=0.9;, score=0.692 total time=   0.0s
[CV 2/5] END max_depth=5, min_samples_split=0.9;, score=0.684 total time=   0.0s
[CV 3/5] END max_depth=5, min_samples_split=0.9;, score=0.670 total time=   0.0s
[CV 4/5] END max_depth=5, min_samples_split=0.9;, score=0.654 total time=   0.0s
[CV 5/5] END max_depth=5, min_samples_split=0.9;, score=0.677 total time=   0.0s
[CV 1/5] END max_depth=5, min_samples_split=1.0;, score=0.692 total time=   0.0s
[CV 2/5] END max_depth=5, min_samples_split=1.0;, score=0.684 total time=   0.0s
[CV 3/5] END max_depth=5, min_samples_split=1.0;, score=0.670 total time=   0.0s
[CV 4/5] END max_depth=5, min_samples_split=1.0;, score=0.654 total time=   0.0s
[CV 5/5] END max_depth=5, min_samples_split=1.0;, score=0.677 total time=   0.0s
[CV 1/5] END max_depth=10, min_samples_split=0.1;, score=0.936 total time=   0.0s
[CV 2/5] END max_depth=10, min_samples_split=0.1;, score=0.947 total time=   0.0s
[CV 3/5] END max_depth=10, min_samples_split=0.1;, score=0.935 total time=   0.0s
[CV 4/5] END max_depth=10, min_samples_split=0.1;, score=0.925 total time=   0.0s
[CV 5/5] END max_depth=10, min_samples_split=0.1;, score=0.937 total time=   0.0s
[CV 1/5] END max_depth=10, min_samples_split=0.2;, score=0.916 total time=   0.0s
[CV 2/5] END max_depth=10, min_samples_split=0.2;, score=0.914 total time=   0.0s
[CV 3/5] END max_depth=10, min_samples_split=0.2;, score=0.897 total time=   0.0s
[CV 4/5] END max_depth=10, min_samples_split=0.2;, score=0.881 total time=   0.0s
[CV 5/5] END max_depth=10, min_samples_split=0.2;, score=0.895 total time=   0.0s
[CV 1/5] END max_depth=10, min_samples_split=0.30000000000000004;, score=0.849 total time=   0.0s
[CV 2/5] END max_depth=10, min_samples_split=0.30000000000000004;, score=0.880 total time=   0.0s
[CV 3/5] END max_depth=10, min_samples_split=0.30000000000000004;, score=0.876 total time=   0.0s
[CV 4/5] END max_depth=10, min_samples_split=0.30000000000000004;, score=0.849 total time=   0.0s
[CV 5/5] END max_depth=10, min_samples_split=0.30000000000000004;, score=0.870 total time=   0.0s
[CV 1/5] END max_depth=10, min_samples_split=0.4;, score=0.828 total time=   0.0s
[CV 2/5] END max_depth=10, min_samples_split=0.4;, score=0.852 total time=   0.0s
[CV 3/5] END max_depth=10, min_samples_split=0.4;, score=0.850 total time=   0.0s
[CV 4/5] END max_depth=10, min_samples_split=0.4;, score=0.821 total time=   0.0s
[CV 5/5] END max_depth=10, min_samples_split=0.4;, score=0.846 total time=   0.0s
[CV 1/5] END max_depth=10, min_samples_split=0.5;, score=0.828 total time=   0.0s
[CV 2/5] END max_depth=10, min_samples_split=0.5;, score=0.852 total time=   0.0s
[CV 3/5] END max_depth=10, min_samples_split=0.5;, score=0.850 total time=   0.0s
[CV 4/5] END max_depth=10, min_samples_split=0.5;, score=0.821 total time=   0.0s
[CV 5/5] END max_depth=10, min_samples_split=0.5;, score=0.846 total time=   0.0s
[CV 1/5] END max_depth=10, min_samples_split=0.6;, score=0.797 total time=   0.0s
[CV 2/5] END max_depth=10, min_samples_split=0.6;, score=0.813 total time=   0.0s
[CV 3/5] END max_depth=10, min_samples_split=0.6;, score=0.814 total time=   0.0s
[CV 4/5] END max_depth=10, min_samples_split=0.6;, score=0.782 total time=   0.0s
[CV 5/5] END max_depth=10, min_samples_split=0.6;, score=0.814 total time=   0.0s
[CV 1/5] END max_depth=10, min_samples_split=0.7000000000000001;, score=0.750 total time=   0.0s
[CV 2/5] END max_depth=10, min_samples_split=0.7000000000000001;, score=0.753 total time=   0.0s
[CV 3/5] END max_depth=10, min_samples_split=0.7000000000000001;, score=0.745 total time=   0.0s
[CV 4/5] END max_depth=10, min_samples_split=0.7000000000000001;, score=0.728 total time=   0.0s
[CV 5/5] END max_depth=10, min_samples_split=0.7000000000000001;, score=0.752 total time=   0.0s
[CV 1/5] END max_depth=10, min_samples_split=0.8;, score=0.692 total time=   0.0s
[CV 2/5] END max_depth=10, min_samples_split=0.8;, score=0.684 total time=   0.0s
[CV 3/5] END max_depth=10, min_samples_split=0.8;, score=0.670 total time=   0.0s
[CV 4/5] END max_depth=10, min_samples_split=0.8;, score=0.654 total time=   0.0s
[CV 5/5] END max_depth=10, min_samples_split=0.8;, score=0.677 total time=   0.0s
[CV 1/5] END max_depth=10, min_samples_split=0.9;, score=0.692 total time=   0.0s
[CV 2/5] END max_depth=10, min_samples_split=0.9;, score=0.684 total time=   0.0s
[CV 3/5] END max_depth=10, min_samples_split=0.9;, score=0.670 total time=   0.0s
[CV 4/5] END max_depth=10, min_samples_split=0.9;, score=0.654 total time=   0.0s
[CV 5/5] END max_depth=10, min_samples_split=0.9;, score=0.677 total time=   0.0s
[CV 1/5] END max_depth=10, min_samples_split=1.0;, score=0.692 total time=   0.0s
[CV 2/5] END max_depth=10, min_samples_split=1.0;, score=0.684 total time=   0.0s
[CV 3/5] END max_depth=10, min_samples_split=1.0;, score=0.670 total time=   0.0s
[CV 4/5] END max_depth=10, min_samples_split=1.0;, score=0.654 total time=   0.0s
[CV 5/5] END max_depth=10, min_samples_split=1.0;, score=0.677 total time=   0.0s
[CV 1/5] END max_depth=20, min_samples_split=0.1;, score=0.937 total time=   0.0s
[CV 2/5] END max_depth=20, min_samples_split=0.1;, score=0.947 total time=   0.0s
[CV 3/5] END max_depth=20, min_samples_split=0.1;, score=0.935 total time=   0.0s
[CV 4/5] END max_depth=20, min_samples_split=0.1;, score=0.926 total time=   0.0s
[CV 5/5] END max_depth=20, min_samples_split=0.1;, score=0.937 total time=   0.0s
[CV 1/5] END max_depth=20, min_samples_split=0.2;, score=0.916 total time=   0.0s
[CV 2/5] END max_depth=20, min_samples_split=0.2;, score=0.914 total time=   0.0s
[CV 3/5] END max_depth=20, min_samples_split=0.2;, score=0.897 total time=   0.0s
[CV 4/5] END max_depth=20, min_samples_split=0.2;, score=0.881 total time=   0.0s
[CV 5/5] END max_depth=20, min_samples_split=0.2;, score=0.895 total time=   0.0s
[CV 1/5] END max_depth=20, min_samples_split=0.30000000000000004;, score=0.849 total time=   0.0s
[CV 2/5] END max_depth=20, min_samples_split=0.30000000000000004;, score=0.880 total time=   0.0s
[CV 3/5] END max_depth=20, min_samples_split=0.30000000000000004;, score=0.876 total time=   0.0s
[CV 4/5] END max_depth=20, min_samples_split=0.30000000000000004;, score=0.849 total time=   0.0s
[CV 5/5] END max_depth=20, min_samples_split=0.30000000000000004;, score=0.870 total time=   0.0s
[CV 1/5] END max_depth=20, min_samples_split=0.4;, score=0.828 total time=   0.0s
[CV 2/5] END max_depth=20, min_samples_split=0.4;, score=0.852 total time=   0.0s
[CV 3/5] END max_depth=20, min_samples_split=0.4;, score=0.850 total time=   0.0s
[CV 4/5] END max_depth=20, min_samples_split=0.4;, score=0.821 total time=   0.0s
[CV 5/5] END max_depth=20, min_samples_split=0.4;, score=0.846 total time=   0.0s
[CV 1/5] END max_depth=20, min_samples_split=0.5;, score=0.828 total time=   0.0s
[CV 2/5] END max_depth=20, min_samples_split=0.5;, score=0.852 total time=   0.0s
[CV 3/5] END max_depth=20, min_samples_split=0.5;, score=0.850 total time=   0.0s
[CV 4/5] END max_depth=20, min_samples_split=0.5;, score=0.821 total time=   0.0s
[CV 5/5] END max_depth=20, min_samples_split=0.5;, score=0.846 total time=   0.0s
[CV 1/5] END max_depth=20, min_samples_split=0.6;, score=0.797 total time=   0.0s
[CV 2/5] END max_depth=20, min_samples_split=0.6;, score=0.813 total time=   0.0s
[CV 3/5] END max_depth=20, min_samples_split=0.6;, score=0.814 total time=   0.0s
[CV 4/5] END max_depth=20, min_samples_split=0.6;, score=0.782 total time=   0.0s
[CV 5/5] END max_depth=20, min_samples_split=0.6;, score=0.814 total time=   0.0s
[CV 1/5] END max_depth=20, min_samples_split=0.7000000000000001;, score=0.750 total time=   0.0s
[CV 2/5] END max_depth=20, min_samples_split=0.7000000000000001;, score=0.753 total time=   0.0s
[CV 3/5] END max_depth=20, min_samples_split=0.7000000000000001;, score=0.745 total time=   0.0s
[CV 4/5] END max_depth=20, min_samples_split=0.7000000000000001;, score=0.728 total time=   0.0s
[CV 5/5] END max_depth=20, min_samples_split=0.7000000000000001;, score=0.752 total time=   0.0s
[CV 1/5] END max_depth=20, min_samples_split=0.8;, score=0.692 total time=   0.0s
[CV 2/5] END max_depth=20, min_samples_split=0.8;, score=0.684 total time=   0.0s
[CV 3/5] END max_depth=20, min_samples_split=0.8;, score=0.670 total time=   0.0s
[CV 4/5] END max_depth=20, min_samples_split=0.8;, score=0.654 total time=   0.0s
[CV 5/5] END max_depth=20, min_samples_split=0.8;, score=0.677 total time=   0.0s
[CV 1/5] END max_depth=20, min_samples_split=0.9;, score=0.692 total time=   0.0s
[CV 2/5] END max_depth=20, min_samples_split=0.9;, score=0.684 total time=   0.0s
[CV 3/5] END max_depth=20, min_samples_split=0.9;, score=0.670 total time=   0.0s
[CV 4/5] END max_depth=20, min_samples_split=0.9;, score=0.654 total time=   0.0s
[CV 5/5] END max_depth=20, min_samples_split=0.9;, score=0.677 total time=   0.0s
[CV 1/5] END max_depth=20, min_samples_split=1.0;, score=0.692 total time=   0.0s
[CV 2/5] END max_depth=20, min_samples_split=1.0;, score=0.684 total time=   0.0s
[CV 3/5] END max_depth=20, min_samples_split=1.0;, score=0.670 total time=   0.0s
[CV 4/5] END max_depth=20, min_samples_split=1.0;, score=0.654 total time=   0.0s
[CV 5/5] END max_depth=20, min_samples_split=1.0;, score=0.677 total time=   0.0s
Grid Search Results:
{'mean_fit_time': array([0.01793795, 0.01365795, 0.01034584, 0.00998201, 0.00968375,
       0.00919371, 0.00805531, 0.00647502, 0.0070219 , 0.00626655,
       0.01232347, 0.0136539 , 0.01040902, 0.01022491, 0.0096272 ,
       0.00876679, 0.00722232, 0.00549202, 0.00660048, 0.00576396,
       0.01803284, 0.01193585, 0.0098424 , 0.00931005, 0.01038899,
       0.01052971, 0.007619  , 0.00622945, 0.00554309, 0.00584545,
       0.01788516, 0.01276417, 0.01033468, 0.00934911, 0.01064281,
       0.00888958, 0.00868831, 0.00622072, 0.00687342, 0.00634537]), 'std_fit_time': array([3.62540687e-04, 1.71024030e-03, 1.01081223e-04, 1.11673406e-03,
       1.27170808e-03, 7.63725939e-04, 1.49305824e-04, 1.99452747e-04,
       1.54245421e-03, 1.87524990e-04, 2.46636506e-04, 2.95196485e-03,
       9.58808668e-05, 1.32373407e-03, 9.73743738e-04, 1.41903992e-03,
       1.17829965e-03, 5.58801260e-04, 2.13676575e-03, 1.06090578e-04,
       1.51620185e-03, 7.95138087e-04, 5.26545087e-04, 9.60911700e-04,
       1.98275636e-03, 1.47889253e-03, 1.13491411e-04, 1.17696093e-03,
       3.73297859e-04, 4.25120226e-04, 1.66347640e-03, 1.45052742e-03,
       5.99887647e-04, 4.91291528e-04, 1.82251867e-03, 1.04767953e-03,
       1.41683714e-03, 2.39986301e-04, 2.06297308e-03, 1.44355124e-04]), 'mean_score_time': array([0.00425887, 0.00460052, 0.00396752, 0.0041244 , 0.00371642,
       0.00424824, 0.00365424, 0.00349264, 0.00338583, 0.00340829,
       0.00385275, 0.0056107 , 0.00375853, 0.00411191, 0.00398917,
       0.00373616, 0.00342808, 0.00321245, 0.00329723, 0.00325661,
       0.00421238, 0.00350823, 0.00349979, 0.00410066, 0.00451164,
       0.00417023, 0.00349693, 0.00371151, 0.00357375, 0.00414972,
       0.00407476, 0.0041286 , 0.00368938, 0.00346031, 0.00374012,
       0.00433202, 0.00398989, 0.00332532, 0.00347228, 0.0040482 ]), 'std_score_time': array([1.01988658e-04, 1.11765744e-03, 7.85852983e-04, 5.77430918e-04,
       2.74236317e-04, 8.97871689e-04, 1.14345312e-04, 2.60880542e-04,
       8.86057109e-05, 1.10005888e-04, 1.75618200e-04, 2.24679730e-03,
       1.19480454e-04, 6.35661323e-04, 9.31621278e-04, 8.36535971e-04,
       1.44275244e-04, 1.07882128e-04, 1.82658920e-04, 2.14657419e-04,
       7.08532701e-04, 1.22344986e-04, 1.21192120e-04, 1.03854251e-03,
       1.28388397e-03, 6.96487628e-04, 9.01197504e-05, 8.37110432e-04,
       4.14038026e-04, 1.51447250e-03, 9.25956374e-05, 6.65205244e-04,
       1.85844650e-04, 6.95983976e-05, 2.17614520e-04, 1.60505438e-03,
       8.79053710e-04, 1.27217401e-04, 3.00229201e-04, 1.15972513e-03]), 'param_max_depth': masked_array(data=[None, None, None, None, None, None, None, None, None,
                   None, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 10, 10, 10, 10, 10,
                   10, 10, 10, 10, 10, 20, 20, 20, 20, 20, 20, 20, 20, 20,
                   20],
             mask=[False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False],
       fill_value='?',
            dtype=object), 'param_min_samples_split': masked_array(data=[0.1, 0.2, 0.30000000000000004, 0.4, 0.5, 0.6,
                   0.7000000000000001, 0.8, 0.9, 1.0, 0.1, 0.2,
                   0.30000000000000004, 0.4, 0.5, 0.6, 0.7000000000000001,
                   0.8, 0.9, 1.0, 0.1, 0.2, 0.30000000000000004, 0.4, 0.5,
                   0.6, 0.7000000000000001, 0.8, 0.9, 1.0, 0.1, 0.2,
                   0.30000000000000004, 0.4, 0.5, 0.6, 0.7000000000000001,
                   0.8, 0.9, 1.0],
             mask=[False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False,
                   False, False, False, False, False, False, False, False],
       fill_value='?',
            dtype=object), 'params': [{'max_depth': None, 'min_samples_split': 0.1}, {'max_depth': None, 'min_samples_split': 0.2}, {'max_depth': None, 'min_samples_split': 0.30000000000000004}, {'max_depth': None, 'min_samples_split': 0.4}, {'max_depth': None, 'min_samples_split': 0.5}, {'max_depth': None, 'min_samples_split': 0.6}, {'max_depth': None, 'min_samples_split': 0.7000000000000001}, {'max_depth': None, 'min_samples_split': 0.8}, {'max_depth': None, 'min_samples_split': 0.9}, {'max_depth': None, 'min_samples_split': 1.0}, {'max_depth': 5, 'min_samples_split': 0.1}, {'max_depth': 5, 'min_samples_split': 0.2}, {'max_depth': 5, 'min_samples_split': 0.30000000000000004}, {'max_depth': 5, 'min_samples_split': 0.4}, {'max_depth': 5, 'min_samples_split': 0.5}, {'max_depth': 5, 'min_samples_split': 0.6}, {'max_depth': 5, 'min_samples_split': 0.7000000000000001}, {'max_depth': 5, 'min_samples_split': 0.8}, {'max_depth': 5, 'min_samples_split': 0.9}, {'max_depth': 5, 'min_samples_split': 1.0}, {'max_depth': 10, 'min_samples_split': 0.1}, {'max_depth': 10, 'min_samples_split': 0.2}, {'max_depth': 10, 'min_samples_split': 0.30000000000000004}, {'max_depth': 10, 'min_samples_split': 0.4}, {'max_depth': 10, 'min_samples_split': 0.5}, {'max_depth': 10, 'min_samples_split': 0.6}, {'max_depth': 10, 'min_samples_split': 0.7000000000000001}, {'max_depth': 10, 'min_samples_split': 0.8}, {'max_depth': 10, 'min_samples_split': 0.9}, {'max_depth': 10, 'min_samples_split': 1.0}, {'max_depth': 20, 'min_samples_split': 0.1}, {'max_depth': 20, 'min_samples_split': 0.2}, {'max_depth': 20, 'min_samples_split': 0.30000000000000004}, {'max_depth': 20, 'min_samples_split': 0.4}, {'max_depth': 20, 'min_samples_split': 0.5}, {'max_depth': 20, 'min_samples_split': 0.6}, {'max_depth': 20, 'min_samples_split': 0.7000000000000001}, {'max_depth': 20, 'min_samples_split': 0.8}, {'max_depth': 20, 'min_samples_split': 0.9}, {'max_depth': 20, 'min_samples_split': 1.0}], 'split0_test_score': array([0.93730181, 0.9160884 , 0.84869493, 0.82794936, 0.82794936,
       0.79748725, 0.75029107, 0.69175893, 0.69175893, 0.69175893,
       0.9089714 , 0.90057156, 0.84869493, 0.82794936, 0.82794936,
       0.79748725, 0.75029107, 0.69175893, 0.69175893, 0.69175893,
       0.93634074, 0.9160884 , 0.84869493, 0.82794936, 0.82794936,
       0.79748725, 0.75029107, 0.69175893, 0.69175893, 0.69175893,
       0.93730181, 0.9160884 , 0.84869493, 0.82794936, 0.82794936,
       0.79748725, 0.75029107, 0.69175893, 0.69175893, 0.69175893]), 'split1_test_score': array([0.94746502, 0.91443934, 0.88002498, 0.8516226 , 0.8516226 ,
       0.81261881, 0.75315735, 0.68368509, 0.68368509, 0.68368509,
       0.92617541, 0.90427189, 0.88002498, 0.8516226 , 0.8516226 ,
       0.81261881, 0.75315735, 0.68368509, 0.68368509, 0.68368509,
       0.94746502, 0.91443934, 0.88002498, 0.8516226 , 0.8516226 ,
       0.81261881, 0.75315735, 0.68368509, 0.68368509, 0.68368509,
       0.94746502, 0.91443934, 0.88002498, 0.8516226 , 0.8516226 ,
       0.81261881, 0.75315735, 0.68368509, 0.68368509, 0.68368509]), 'split2_test_score': array([0.93467129, 0.89741545, 0.8758676 , 0.84978872, 0.84978872,
       0.81408353, 0.74532048, 0.67030394, 0.67030394, 0.67030394,
       0.9056396 , 0.88703396, 0.8758676 , 0.84978872, 0.84978872,
       0.81408353, 0.74532048, 0.67030394, 0.67030394, 0.67030394,
       0.93467129, 0.89741545, 0.8758676 , 0.84978872, 0.84978872,
       0.81408353, 0.74532048, 0.67030394, 0.67030394, 0.67030394,
       0.93467129, 0.89741545, 0.8758676 , 0.84978872, 0.84978872,
       0.81408353, 0.74532048, 0.67030394, 0.67030394, 0.67030394]), 'split3_test_score': array([0.92551228, 0.88086844, 0.84915574, 0.8209622 , 0.8209622 ,
       0.781683  , 0.72827203, 0.65426159, 0.65426159, 0.65426159,
       0.88207967, 0.87152009, 0.84915574, 0.8209622 , 0.8209622 ,
       0.781683  , 0.72827203, 0.65426159, 0.65426159, 0.65426159,
       0.92505621, 0.88086844, 0.84915574, 0.8209622 , 0.8209622 ,
       0.781683  , 0.72827203, 0.65426159, 0.65426159, 0.65426159,
       0.92551228, 0.88086844, 0.84915574, 0.8209622 , 0.8209622 ,
       0.781683  , 0.72827203, 0.65426159, 0.65426159, 0.65426159]), 'split4_test_score': array([0.93748886, 0.89481142, 0.86989733, 0.84612023, 0.84612023,
       0.81373086, 0.75236095, 0.67690594, 0.67690594, 0.67690594,
       0.90133215, 0.88854737, 0.86989733, 0.84612023, 0.84612023,
       0.81373086, 0.75236095, 0.67690594, 0.67690594, 0.67690594,
       0.93747189, 0.89481142, 0.86989733, 0.84612023, 0.84612023,
       0.81373086, 0.75236095, 0.67690594, 0.67690594, 0.67690594,
       0.93748886, 0.89481142, 0.86989733, 0.84612023, 0.84612023,
       0.81373086, 0.75236095, 0.67690594, 0.67690594, 0.67690594]), 'mean_test_score': array([0.93648785, 0.90072461, 0.86472812, 0.83928862, 0.83928862,
       0.80392069, 0.74588038, 0.6753831 , 0.6753831 , 0.6753831 ,
       0.90483965, 0.89038897, 0.86472812, 0.83928862, 0.83928862,
       0.80392069, 0.74588038, 0.6753831 , 0.6753831 , 0.6753831 ,
       0.93620103, 0.90072461, 0.86472812, 0.83928862, 0.83928862,
       0.80392069, 0.74588038, 0.6753831 , 0.6753831 , 0.6753831 ,
       0.93648785, 0.90072461, 0.86472812, 0.83928862, 0.83928862,
       0.80392069, 0.74588038, 0.6753831 , 0.6753831 , 0.6753831 ]), 'std_test_score': array([0.00701321, 0.0131478 , 0.01329936, 0.01243774, 0.01243774,
       0.01273644, 0.00921709, 0.01273833, 0.01273833, 0.01273833,
       0.01416506, 0.01155078, 0.01329936, 0.01243774, 0.01243774,
       0.01273644, 0.00921709, 0.01273833, 0.01273833, 0.01273833,
       0.00714226, 0.0131478 , 0.01329936, 0.01243774, 0.01243774,
       0.01273644, 0.00921709, 0.01273833, 0.01273833, 0.01273833,
       0.00701321, 0.0131478 , 0.01329936, 0.01243774, 0.01243774,
       0.01273644, 0.00921709, 0.01273833, 0.01273833, 0.01273833]), 'rank_test_score': array([ 1,  5,  9, 13, 13, 21, 25, 29, 29, 29,  4,  8,  9, 13, 13, 21, 25,
       29, 29, 29,  3,  5,  9, 13, 13, 21, 25, 29, 29, 29,  1,  5,  9, 13,
       13, 21, 25, 29, 29, 29], dtype=int32)}
Best Parameters: {'max_depth': None, 'min_samples_split': 0.1}
Mean Test Scores: [0.93648785 0.90072461 0.86472812 0.83928862 0.83928862 0.80392069
 0.74588038 0.6753831  0.6753831  0.6753831  0.90483965 0.89038897
 0.86472812 0.83928862 0.83928862 0.80392069 0.74588038 0.6753831
 0.6753831  0.6753831  0.93620103 0.90072461 0.86472812 0.83928862
 0.83928862 0.80392069 0.74588038 0.6753831  0.6753831  0.6753831
 0.93648785 0.90072461 0.86472812 0.83928862 0.83928862 0.80392069
 0.74588038 0.6753831  0.6753831  0.6753831 ]
AUC for the Best Model: 0.9459130947003455

According to the output, the best decision tree parameters are a max_depth of none and a min_samples_split of 0.1. Also, the best AUC score possible is roughly 0.946. Given how to close it is to one, a score of 0.946 suggests that the decision tree model is effective at distinguishing between the positive and negative classes. Below, I further refined the decision tree using cross validation. The results show similar performance metrics from earlier, but now it also factors in the AUC. This ensures the selection of the best parameters possible.¶

In [ ]:
from sklearn.model_selection import cross_val_score
results = []

for max_depth in max_depths:
    for min_samples_split in min_samples_splits:

        model = DecisionTreeClassifier(max_depth=max_depth, min_samples_split=min_samples_split)

        auc_scores = cross_val_score(model, X, Y, cv=5, scoring='roc_auc')
        recall_scores = cross_val_score(model, X, Y, cv=5, scoring='recall')
        precision_scores = cross_val_score(model, X, Y, cv=5, scoring='precision')
        f1_scores = cross_val_score(model, X, Y, cv=5, scoring='f1')

        results.append({
            'max_depth': max_depth,
            'min_samples_split': min_samples_split,
            'AUC': np.mean(auc_scores),
            'Recall': np.mean(recall_scores),
            'Precision': np.mean(precision_scores),
            'F1': np.mean(f1_scores)
        })

df_results = pd.DataFrame(results)

print(df_results)
    max_depth  min_samples_split       AUC    Recall  Precision        F1
0         NaN                0.1  0.936488  0.870050   0.856195  0.862832
1         NaN                0.2  0.900725  0.708746   0.910136  0.795916
2         NaN                0.3  0.864728  0.830870   0.765836  0.796820
3         NaN                0.4  0.839289  0.830870   0.765836  0.796820
4         NaN                0.5  0.839289  0.830870   0.765836  0.796820
5         NaN                0.6  0.803921  0.830870   0.765836  0.796820
6         NaN                0.7  0.745880  0.893169   0.685405  0.775287
7         NaN                0.8  0.675383  0.938952   0.614532  0.742469
8         NaN                0.9  0.675383  0.938952   0.614532  0.742469
9         NaN                1.0  0.675383  0.938952   0.614532  0.742469
10        5.0                0.1  0.904836  0.937714   0.747749  0.831536
11        5.0                0.2  0.890389  0.870878   0.759751  0.811409
12        5.0                0.3  0.864728  0.830870   0.765836  0.796820
13        5.0                0.4  0.839289  0.830870   0.765836  0.796820
14        5.0                0.5  0.839289  0.830870   0.765836  0.796820
15        5.0                0.6  0.803921  0.830870   0.765836  0.796820
16        5.0                0.7  0.745880  0.893169   0.685405  0.775287
17        5.0                0.8  0.675383  0.938952   0.614532  0.742469
18        5.0                0.9  0.675383  0.938952   0.614532  0.742469
19        5.0                1.0  0.675383  0.938952   0.614532  0.742469
20       10.0                0.1  0.936201  0.870050   0.856195  0.862832
21       10.0                0.2  0.900725  0.708746   0.910136  0.795916
22       10.0                0.3  0.864728  0.830870   0.765836  0.796820
23       10.0                0.4  0.839289  0.830870   0.765836  0.796820
24       10.0                0.5  0.839289  0.830870   0.765836  0.796820
25       10.0                0.6  0.803921  0.830870   0.765836  0.796820
26       10.0                0.7  0.745880  0.893169   0.685405  0.775287
27       10.0                0.8  0.675383  0.938952   0.614532  0.742469
28       10.0                0.9  0.675383  0.938952   0.614532  0.742469
29       10.0                1.0  0.675383  0.938952   0.614532  0.742469
30       20.0                0.1  0.936488  0.870050   0.856195  0.862832
31       20.0                0.2  0.900725  0.708746   0.910136  0.795916
32       20.0                0.3  0.864728  0.830870   0.765836  0.796820
33       20.0                0.4  0.839289  0.830870   0.765836  0.796820
34       20.0                0.5  0.839289  0.830870   0.765836  0.796820
35       20.0                0.6  0.803921  0.830870   0.765836  0.796820
36       20.0                0.7  0.745880  0.893169   0.685405  0.775287
37       20.0                0.8  0.675383  0.938952   0.614532  0.742469
38       20.0                0.9  0.675383  0.938952   0.614532  0.742469
39       20.0                1.0  0.675383  0.938952   0.614532  0.742469

These results reinforce my findings that the best decision tree parameters are a max_depth of none and a min_samples_split of 0.1. The ROC plot below demonstrates the effectiveness of this model because the area equals 0.95. Considering its closeness to one, this model accomplishes its role well.¶

In [ ]:
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

best_max_depth = best_params['max_depth']
best_min_samples_split = best_params['min_samples_split']

best_model = DecisionTreeClassifier(max_depth=best_max_depth, min_samples_split=best_min_samples_split)

best_model.fit(X, Y)

y_pred_proba = best_model.predict_proba(X)[:, 1]

fpr, tpr, _ = roc_curve(Y, y_pred_proba)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 8))
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = {:.2f})'.format(roc_auc))
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC)')
plt.legend(loc='lower right')
plt.show()

Summary:¶

The decision tree model was optimized with parameters set to max_depth=None and min_samples_split=0.1. The former indicates an unrestricted growth potential, allowing the tree to expand until certain conditions are met, while the latter enforces a conservative node-splitting condition, requiring at least 10% of samples for a split. This model achieved notable success, with a high AUC score of approximately 0.95, signaling robust discriminatory power. The interpretation of features such as "NumDash" and "InsecureForms" contributed to the model's ability to identify phishing websites. This decision tree provide great guidance once I split the data into training and testing sets.¶

Phase V - Model Building and Prediction¶

After Phase IV, I revised my selection of the seven features to build the model. Instead of the seven I chose in Phase II & III, I decided to incorporate the features I found in Phase IV.¶

  • IpAddress
  • InsecureForms
  • NumQueryComponents
  • NumDashInHostname
  • NoHttps
  • PctExtResourceUrls
  • NumDots ### Most of these features were found to be the most influential. I replaced the original six features, but kept IpAddress. IP addresses are commonly associated with phishing so it is reasonable to keep it in the list. Out of every feature on the original list, IP address makes the most sense to keep. To start Phase V, I reduced the data frame down the new list of seven features. I also separated the target variable and made training and testing data sets.
In [244]:
Y=df[["CLASS_LABEL"]]
X=df[["InsecureForms","NumQueryComponents","NumDashInHostname","NoHttps","PctExtResourceUrls","NumDots","IpAddress"]]
In [245]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2,random_state=1)

This phase focuses on finding and optimizing the best model. By comparing performance metrics from DecisionTreeClassifier, SVM, and Random Forests I will choose the optimal model for this task. First, I will begin with the DecisionTreeClassifier and look for the best parameters with KFold cross validation.¶

In [228]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import tree
from sklearn.model_selection import KFold
from sklearn.metrics import f1_score

kf = KFold(n_splits=5, random_state=None, shuffle=True)

min_samples_splits = np.linspace(0.1, 1.0, 10, endpoint=True)
print(min_samples_splits)

avg_f1_test = []
avg_f1_train = []
avg_n_leaves = []

for mss in min_samples_splits:

    f1_train = []
    f1_test = []
    n_leaves = []

    for train_index, test_index in kf.split(X):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        Y_train, Y_test = Y.iloc[train_index], Y.iloc[test_index]

        clf = tree.DecisionTreeClassifier(min_samples_split=mss)

        clf = clf.fit(X_train, Y_train)

        Y_test_predicted = clf.predict(X_test)
        Y_train_predicted = clf.predict(X_train)

        f1_test.append(f1_score(Y_test, Y_test_predicted, pos_label=0))
        f1_train.append(f1_score(Y_train, Y_train_predicted, pos_label=0))

        n_leaves.append(clf.get_n_leaves())

    avg_f1_test.append(np.mean(f1_test))
    avg_f1_train.append(np.mean(f1_train))
    avg_n_leaves.append(np.mean(n_leaves))
[0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1. ]

Although min_samples_splits=0.1 may generate an F1 score of 0.82, it reduces the model's ability to generalize. 0.5 is the best parameter to reduce overfitting while not sacrificing the F1 score.¶

In [230]:
plt.figure(figsize=(4,4))
plt.plot(min_samples_splits,avg_f1_test,label='Testing Set')
plt.plot(min_samples_splits,avg_f1_train,label='Training Set')
plt.legend()
plt.xticks(min_samples_splits)
plt.grid(color='b', axis='x', linestyle='-.', linewidth=1,alpha=0.2)
plt.xlabel('Minimum Sample Split Fraction')
plt.ylabel('F1')
Out[230]:
Text(0, 0.5, 'F1')

In Phase IV I found the optimal max_depth so I skipped the optimization process for that here. I also, plotted the ROC to visualize the effectiveness of the model.¶

In [231]:
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

clf = tree.DecisionTreeClassifier(max_depth=None, min_samples_split=0.5)
clf.fit(X, Y)

Y_prob = clf.predict_proba(X)[:, 1]

fpr, tpr, thresholds = roc_curve(Y, Y_prob)

roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 8))
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = {:.2f})'.format(roc_auc))
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC)')
plt.legend(loc='lower right')
plt.show()

This model performs well in most regards considering the constraints of this task.¶

In [234]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
accuracy = accuracy_score(Y, Y_pred)
precision = precision_score(Y, Y_pred)
recall = recall_score(Y, Y_pred)
f1 = f1_score(Y, Y_pred)
auc = roc_auc_score(Y, Y_prob)

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")
print(f"AUC: {auc:.4f}")
Accuracy: 0.8163
Precision: 0.8260
Recall: 0.8007
F1 Score: 0.8132
AUC: 0.8005

Then, I performed a similar optimization process for SVM using grid search.¶

In [235]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import make_scorer, roc_auc_score

param_grid = {'C': [0.1, 1, 10, 20, 30]}

svm_classifier = SVC(kernel='rbf')

grid_search = GridSearchCV(svm_classifier, param_grid, cv=5, scoring="roc_auc")
grid_search.fit(X_train, Y_train)

print("Best C:", grid_search.best_params_['C'])

best_svm_classifier = SVC(C=grid_search.best_params_['C'], kernel='rbf')
best_svm_classifier.fit(X_train, Y_train)

Y_pred = best_svm_classifier.predict(X)
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
Best C: 20
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

Overall, SVM performs better than DecisionTreeClassifier.¶

In [236]:
clf = svm.SVC(C=20,kernel='rbf')
clf.fit(X, Y)
Y_pre=clf.predict(X)

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

accuracy = accuracy_score(Y, Y_pred)
precision = precision_score(Y, Y_pred)
recall = recall_score(Y, Y_pred)
f1 = f1_score(Y, Y_pred)

Y_prob = clf.decision_function(X)
auc = roc_auc_score(Y, Y_prob)

print('Accuracy:', accuracy)
print('Precision:', precision)
print('Recall:', recall)
print('F1 Score:', f1)
print('AUC Score:', auc)
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
Accuracy: 0.8138768787317274
Precision: 0.820675105485232
Recall: 0.8023927392739274
F1 Score: 0.8114309553608678
AUC Score: 0.9050807516016706

Finally, I tried Random Forests. Random Forests performed better than the other models in every aspect. For this reason, I will use Random Forest moving forward.¶

In [237]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

rf = RandomForestClassifier()

grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='roc_auc', verbose=2, n_jobs=-1)
grid_search.fit(X_train, Y_train)

best_params = grid_search.best_params_
print("Best Parameters:", best_params)

best_rf = RandomForestClassifier(
    n_estimators=best_params['n_estimators'],
    max_depth=best_params['max_depth'],
    min_samples_split=best_params['min_samples_split'],
    min_samples_leaf=best_params['min_samples_leaf']
)
best_rf.fit(X_train, Y_train)

Y_pred = best_rf.predict(X_test)
Y_prob = best_rf.predict_proba(X_test)[:, 1]

accuracy = accuracy_score(Y_test, Y_pred)
precision = precision_score(Y_test, Y_pred)
recall = recall_score(Y_test, Y_pred)
f1 = f1_score(Y_test, Y_pred)
auc = roc_auc_score(Y_test, Y_prob)

print('Accuracy:', accuracy)
print('Precision:', precision)
print('Recall:', recall)
print('F1 Score:', f1)
print('AUC Score:', auc)
Fitting 5 folds for each of 108 candidates, totalling 540 fits
/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_search.py:909: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  self.best_estimator_.fit(X, y, **fit_params)
Best Parameters: {'max_depth': 30, 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 150}
<ipython-input-237-07315e3d4f15>:26: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  best_rf.fit(X_train, Y_train)
Accuracy: 0.8578784757981462
Precision: 0.860655737704918
Recall: 0.8571428571428571
F1 Score: 0.8588957055214723
AUC Score: 0.931895710467139

Compared to DecisionTreeClassifier, Random Forests has a higher AUC, accuracy, precision, recall, and F1 Score.¶

In [238]:
from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

fpr, tpr, thresholds = roc_curve(Y_test, Y_prob)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 8))
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (AUC = {:.2f})'.format(roc_auc))
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.show()

To further test the model's performance I used a confusion matrix to measure the models predictive abilities. Random Forests generates a small amount of false negatives and false positives. This leads to a type I error of 0.141 and type II error of 0.143. Stastically, this model outperforms every other model.¶

In [239]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(Y_test, Y_pred)

disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=best_rf.classes_)
disp.plot(cmap='Blues', values_format='d')
plt.title('Confusion Matrix')
plt.show()

After finding the ideal model and optimizing it to its full potential, I can import and apply it to the testing data set.¶

In [242]:
tdf=pd.read_csv('Phishing_Legitimate_test_student (2).csv',index_col='id',na_values=['',' ','n/a','null'])
tdf.head()
Out[242]:
NumDots SubdomainLevel PathLevel UrlLength NumDash NumDashInHostname AtSymbol TildeSymbol NumUnderscore NumPercent ... ExtFavicon InsecureForms RelativeFormAction ExtFormAction AbnormalFormAction RightClickDisabled PopUpWindow IframeOrFrame MissingTitle ImagesOnlyInForm
id
1 6 1 2 59 0 0 0 0 0 0 ... 0 1 0 1 0 0 0 0 0 0
2 2 1 3 76 4 0 0 0 0 0 ... 0 1 1 0 1 0 0 1 0 0
3 3 1 1 59 0 0 0 0 3 0 ... 0 0 0 0 0 0 0 1 0 0
4 5 1 3 67 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0
5 2 0 4 88 3 0 0 0 0 0 ... 0 1 1 0 0 0 0 1 0 0

5 rows × 37 columns

In [243]:
tdf=tdf[["InsecureForms","NumQueryComponents","NumDashInHostname","NoHttps","PctExtResourceUrls","NumDots","IpAddress"]]
tdf.head()
Out[243]:
InsecureForms NumQueryComponents NumDashInHostname NoHttps PctExtResourceUrls NumDots IpAddress
id
1 1 0 0 1 1.00000 6 0
2 1 0 0 1 1.00000 2 0
3 0 0 0 1 0.87500 3 0
4 1 0 0 1 0.15625 5 0
5 1 0 0 1 0.00000 2 0
In [246]:
best_params = {
    'n_estimators': 150,
    'max_depth': 30,
    'min_samples_split': 10,
    'min_samples_leaf': 4
}

best_rf = RandomForestClassifier(
    n_estimators=best_params['n_estimators'],
    max_depth=best_params['max_depth'],
    min_samples_split=best_params['min_samples_split'],
    min_samples_leaf=best_params['min_samples_leaf']
)

best_rf.fit(X,Y)
<ipython-input-246-d34dca0acf94>:15: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  best_rf.fit(X,Y)
Out[246]:
RandomForestClassifier(max_depth=30, min_samples_leaf=4, min_samples_split=10,
                       n_estimators=150)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(max_depth=30, min_samples_leaf=4, min_samples_split=10,
                       n_estimators=150)

After implementing the optimal parameters, I applied the model to the testing set. Finally, I transferred the predicted values to a csv file.¶

In [248]:
Y_pred = best_rf.predict(tdf)
print(Y_pred)
[1 1 0 ... 0 1 0]
In [249]:
data = {
     "Prediction": Y_pred,
}
final = pd.DataFrame(data, index=[tdf.index])
final
Out[249]:
Prediction
id
1 1
2 1
3 0
4 1
5 1
... ...
4996 0
4997 0
4998 0
4999 1
5000 0

5000 rows × 1 columns

In [250]:
final.to_csv("Predictions.csv")

Analysis¶

Phishing sites often attempt to disguise their true identity by using IP addresses instead of legitimate domain names. The presence of an IP address in the URL may indicate a suspicious attempt to avoid traditional domain-based checks, as legitimate websites typically use domain names for easier recognition. Monitoring and analyzing IP addresses can be crucial in identifying potential phishing attempts.¶

Phishing attacks commonly involve the collection of sensitive information through deceptive forms. The feature "InsecureForms" likely indicates whether a website employs secure (HTTPS) or insecure (HTTP) forms. Phishers often exploit insecure forms to capture login credentials, personal details, or financial information. Identifying websites with insecure forms is essential for recognizing potential threats to user data security.¶

The presence of dashes in the hostname can be an indicator of a suspicious domain. Phishers may use variations of legitimate domain names by adding dashes to create confusion and trick users into believing they are interacting with a trusted source. Monitoring the "NumDashInHostname" feature is valuable for detecting potential domain name manipulation.¶

Executive Summary¶

In the comprehensive evaluation of various machine learning models for the detection of phishing websites, the Random Forest classifier emerged as the standout performer, surpassing its counterparts in terms of predictive accuracy and robustness. Random Forests exhibited a remarkable accuracy rate, outperforming other models across multiple evaluation metrics. The model's precision, recall, and F1 score consistently demonstrated superior performance, showcasing its ability to effectively distinguish between legitimate and phishing websites. This high level of accuracy is crucial in the context of cybersecurity, where the cost of false positives and false negatives can have severe consequences.¶

One of the key strengths of Random Forests lies in its ensemble learning approach, which harnesses the collective intelligence of multiple decision trees. This ensemble method mitigates overfitting, enhances generalization, and ensures a robust performance across diverse datasets. The model's capacity to handle complex relationships within the feature space and adapt to changing patterns in phishing tactics further solidifies its position as a top-performing solution.¶

Additionally, the Random Forest classifier demonstrated notable efficiency in handling diverse feature sets related to URL characteristics, IP addresses, and security attributes. Its ability to discern intricate patterns within these features contributed significantly to its exceptional predictive power. This adaptability is paramount in an ever-evolving threat landscape where phishing techniques continually evolve.¶

In conclusion, the Random Forest model not only outshone its peers in terms of accuracy and performance metrics but also showcased resilience and adaptability in tackling the nuanced challenges posed by the detection of phishing websites. Its ensemble learning paradigm, coupled with a robust feature set, positions Random Forests as a highly effective and reliable solution for organizations seeking state-of-the-art cybersecurity measures against phishing threats.¶

In [ ]: